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VIDEO-BASED IMAGE CONTROL SYSTEM 

I - i * ^ 

TECHNICAL FIELD 

1 ' ' ' ' ' - . - 

This inventipn relates to aa image processing system, and more particularly to ^ 
video-based image control system for processing stereo image data. • 

BACKGROUND 

A variety of operating systems are, currently available for interacting with and 
. controlling a computer system. Many of these operating systems use standardized 

interfaces based on commonly accepted graphical user interface (GUI) functions and 

■ ■ ■ '• ■ ' . . • • ■ ■• * ■ ■ " J. 

control techniques. As a result, different computer platforms and user applications can be 

easily controlled by a user who is relatively unfamiliar with the platform and/or 

application, as the functions and control techniques are generally common &om one GUI 

- • • ■ ■■ ■ , " ' . - 

toanother; 

One commonly accepted control technique is the lise of a mouse or trackball style 
pointing device to move a cursor over screen objects. An action^ such as clicking (single 

■i ■ i ' ■ t' . - 

or double) on the object, executes a GUI fiinction. However, for someone who is 
unfamiliar with operating a computer mouse, selecting GUI functions may present a 
challenge that prevents them from interfacing with the computer system. There also exist 

' - - ^ . . ' * ' . ' . - I- ■ ' ' . - - , . 

situations where it becomes impractical to provide access to a computer inoiise or ' 
. t ackball, such as in fix>nt of a department store display.window on a city street, or where 
the user is physically challenged. . 

SUMMARY . 

I ■ • .■ ■ ■ 

In one^ general aspect, a method of using stereovision to interface with a computer 
is disclosed. The method includes capturing a stereo image and processing the stereo 
image to determine position information of an object in the stereo image. The object may 
be controlled by a user. The method further includes using the position infonnation to 
allow the user to interact vdth a computer application. 

The step of capturing the stereo image may include capturing the stereo image 
using a steieo camera. The method also may include recognizing a gesture associated 
with the object by analyzing changes in the position information of the object, and 
controllmg the computer application based on the recognized gesture. The method also 

1 
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The Step of capturing the Stereo image my 
using a stereo camera. The method also may include recognizing a gesture associated 
with the object by analyzing changes in the position information of the object, and 
controlling the computer appUcation based on the recognized gestu^ The method also 
5 , ; incli4e .determining an application state of the computer applicati^ 

application state in recognizing the gesture. The object may be the user, j In another 
instance, the object is a part of the user. The method may include providing feedback to 
the user relative to &e computer application. 

In the above implementation, processmg the stereo unage to determine position 
10 information of the object may include mapping the po^ 

i coordinates associated with the object to screen coordiimtes associated with the computer 

' . application. Processing the stereo image also may include processing the stereo image to , 

,' ■ • ' ' ' •• . - '• ' . .- ' ■ ■ I- . ■■• • ■-. . . 

identify feature information and prpduce a scene description from the feature information. , 

• . . . . ' . 'I . \ • . ■ - ■ ■ ■: , 

f ■ . - , I . „ - - • ■ 

. Processing the stereo image also, niay include aiid^ 
15 idmtify a ch£uige in position of the objecfandm^ ^ 
object. Processing the stereo iniage to produce the scene description also may include 
. „ processing the stereo iniage to identify matching pairs of features in the stereo 
• ' calculating a disparity and a position for each matching feature pair to create a scene 
\ description. 

20 The method.may include analyzing the scene description in a sciene analysis 

. ..process to detenmne position information^ . ' • . 

Capturing the stereo image rnay include capturing a reference image from a . 
. reference camera and a comparison image from a comparison camera, and jprocessing the 
stereo imiage also may include processing the reference image and the comparison image 

25 to create pairs of features. 

■ ■ ' i'- " . '' ■ . - ' ' ' ' ■ ' 

. Processing the stereo image to identify matching pairs of features:in the stereo 

image ^so may include identifying features in the reference iniage, generating for each 

feature in the reference image a set of candidate matching features in the comparison 

image, euid producing a feature pair by selecting a best matching feat^ 

30 candidate matching featiires for each feature in the reference unage. Processing the stereo 

image also may include filtering the referience image and the comparison image. 

• ' ' . ' ' ' ' . ' ' '■'*.■ 

Producing the feature pair may include calculating a match score and rank for 
each of the candidate matching features, and selecting the candidate matching feature 
■ .with the highest match score to produce the feature pair, . , . ; 



■ ■ • ' ■ . • ■ ' • , ■. 

» ' ' ' . 

* ' 1 

' .. ■ . . 1 . ■ , 
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Generating for each feature in the reference image, a set of candidate matching 
features nmy indude selecting Cimdidate matching features from 
. comparison image. 

Featijre pairs loay be eliminated based upon the match score 
5 . m?rtching feature. Feature pairs also my be eliD^ 

ranking candidate matching feature is below a predefined threshold. The feature pair may 
be eliminated if the.match score of the tpp ranking candidate inatching f^ 

I ■ * . - - * ■ ' , " I 

predefined threshold of the match score of a lower ranking candidate matching feature. 
. Calculating the match score may include identi:^ng those feature pairs that are 
,10 neighboring, adjusting the match score of feature pairs in proportion to the match score of 
neighboring candidate matching features at similar disparity, and selecting the candidate 

• . " > . ' . . ■ . x . - 

matching feature with the highest adjusted ruatch, score to create the feature pair. 

. Feature pairs may be eliminated by applying the comparison images 
reference unage and the reference image as th^ 
15 / of feature p^s, and eliminating those fea^ 

which do not have a correspondiug feature pair in the second set of feature pairs. 
The method may include for each feature pair in the scene desc 

real world coordinates by transforming the disparity and position of each feature pair 

' .. , .'i . . .... • . ,i . 

relative to the real world coordinates of the stereo image. Selecting features may include 
20 dividing the reference image and the comparison image of the stereo image into blocks, 

The feature may be described by a pattern of lurninance.of the^ pixels contained with the . 

blocks. Dividirig also may include dividing the images into pixel blocks having a fixed 

size, The pixel blocks may be 8 X 8 pixel block^^ 

Anaiy2dng the scene description to determine the position inforrnatiori of the 
25 object dso may include cropping the scene description to exclude feature 

lying outside of a region of interest in a field of view. Cropping may include establishing 

aboundary of the region of interest. ; . * 

Analyzing the scene description to determine the position infonnation of the 

object also may iiiclude clustering tiie feature information 

> ' , ■ \' 

30 clustera having a collection of features by comparison to neighboring feature information 

' - • ' 

within a predeJfced range, and calciilating a position for each of the cl^ Analyzing 
the scene description also may include eliminating those clusters having less than a 
predefined threshold of features.. ' ' 
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r • »■ ' , • ' . 

Analyzing the scene description also may include selecting the position of the 
clusters that match a predefined criteria, recording the position of the clusters that match 
the predefined criteria as object position coordinates, and outputting the object position 
coordinates. The method also may include deterinining the presence of a uiser from the 
5 clusters by checkiiig features withm a presence detection region. Calculating the position 
. for each of the clusters may exclude those features m the clusters that are outside of an 
object detection region. 

_ The method may mclude defining a dyi^ 
object position coordinates. AdditionaUyj the dynaniic object detection region may be 

10 defined relative to a user's body. 

' ' ' ' . ■ • ' - ■ ' • ■ ' '\ ' ' ■ 

- The method may include defining a body position detection region based on the 

object position coordinates. Defining the body position detection region also may. include 

••' ' ' "p .' ' ' ■ ' - ■ - ■ 

detecting a head position of the user. The method also many include smoothing the 
motion of the object position coordinates to eliminate jitter between consecutive image 
15- .frames.' ' 

The metiiod may include calculating hand orientation mformation from the object 
position coordinates. Outputtiag the object position coordinates may include outputting 
the hand orientation information. Calculating hand odehtation information also may 
include smoothing the changes in the hand orientation infonxiati 

, , J" - * . ' 

20 Defining the dynanuc object detection region also inay include idenlif^g a 

position of a torso-divisioning plane from the collection of features, and determining the 

' ■ . ' ' '■ ' • ■ ' ■ ' . - , . ■ ' ' ■ ' ^ 

position of a hand detection region relative to the torso-divisionirig plane in th^ 

perpendicular to the torso divisioning plane. 

'/ Defiiiing the dynamic object delation region may include identifying a body 

25 \ center position and a body boundary position froiQ the collection of features, id^ 

• - position in(Ucating part of £ui aim of the user from th^ 

intersection of the feature pair cluster with the torso divisioning plane, and identifying the 

arm as either a left arin or a right arm using the arm position relative to the body position. 

This method also may include estabUshmg a shoulder position fi^om the body 

30 ceinter position, the body boundary position, the torso-^divisioning plane, and the left arm 

. or the right arm identification. Defining the dynanaic object detection region may include 

determining position data for the hand detection region relative to the shoulder position. 

TUs technique may include smoothing the position data for the hand detect! 

region. Additionally, this technique may include determining the position of the dynamic 



wo 02/07839 



PCT/USOl/23224 



, object detection regjion relative to the torso 

the torso divisioning plane, determiiiing the position of the dynamic object detection 

region in the horizontal axis relative to the shoulder position, and determining the position 

of the dynamic object detection region in the vertical axis relative to an overall height of 

■ , '■ " • 

5 the iiser using the body boundary position. 

Defining the dynaniic object detection region may include establishing the . 

position of a top of the user's head using topmost feature pairs of the collection of 

' • ' ' • • " . . >' 

features unless the topmost feature pairs are at the boundary, and determiiiing the position 
' •' ■ ' ■ ■' ' '. ' ' ■. , ■ 

of a hand detection region relatiye to the top of the user 's head. , 
... _ ... ■ ■'. ■ •■- ■ ■ ■' 

10 - In another aspect,.a method of using ste^ 

disclosed. The method includes capturing a stereo iraage using a stereo camera, and > 

processing the stereo image to determine position information of an object in the stereo 

■ - ■. ■ ■ ■. - . ■ ■ ■ 

image, wherein the object is controlled by a user. The.method further includes processing 
the stereo image to identify feature information, to produce a scene description firom the 
15 featiire information, and to identify matching pairs of features iii the stereo unage. The . 
method also includes calculating a disparity and a position for each matching feature pair 
. to create the scene description, and analyzing the scene descriptidn in a scene aiialysis. 
process to determine position information of the object, ..The method may include 
clustering the feature informatioh in a region of interest into clusters haying a collection 
20 of features by comparison to neighboring feature information within a predefined range, 
.calculating a position for eaph of the clusters, and using the position iiifomation allow 
; the user to interact with a computer application. . " . = ^ 

. . Additionally, this technique may include mapping the position of the object from 
the feature infonnation from camera coordinates to screen coordinates associated with the 

25 computer application, and using the mapped position to mterface with the computer 

■I - - • . ■•, , - 1 , . ■' ■ ' ". 

Lcation.. " ' , * ^ ' ' : - . " " 

■ I • I ■ . 

. The method nmy iiiclude recognizing a gestiu:e associated 

' ' ■ ".II. - 

analyzing changes in the position, information of the object in.the scene description, and 
combining the position information 2uid the ge^ 
30 appiication. The step of capturing the stereo image rhay in^ 
image using a Stereo camera. 

' ' In another aspect, a stereo vision system for interfacing with an application 
program ruiming on a computer is disclosed. The stereo vision system includes first and 
second video cameras arranged in an adjacent configuration and opierable to produce a' ' 
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- « ' . . ' f' - . ' 

series of stereo video images. A processor is operable to receive the series of stereo video 
images and detect objects appearing in an intersecting field of view of llie cameras. The 
processor executes a process to define ah object detectioii region m 
coordinates relative to a position of the first and second video cameras, select a control 
5 object appearing within the object detection regioti, and map position coordinates of the 
control oj3ject to a position indicator associated with the application program as the 
control object moves within flie object detection region. 
The process may select as a control object a dete^^ 

. - - H ■ ■ ' ■ ' ' ' ■ ' 

the video cameras and Avifhin the object detection region. The control object may be a 

■ '■ . ■ ■ ' 

10 .humanhand. 

. .y- Ahorizontalpositionof the control object relative to the video came^^^ 
mapped to a x-axis screen coordinate of the position indicator. A vertical position of the 

control object relative to the video cameras miay be mapped to a y-axis screen coordinate 

' ■ ■ " ■ ■ ' ' . . ' • ■ . - ■ ■' ■ ■ • -''s 

of the position indicator. 

15 The processor may be configm«d to map a horizontal positi^ ' 

object relative to the yideo cameras to a x-axis screen coordinate of the position indicator, 
map a vertical position of the control object relative to the video cameras to a y-axis 
screen coordinate of the position indicator, and emulate a mouse fimctio^ 
combined x-axis and y-axis screen coordinates provided to the application program, 

20 , The processor may be configured to emidate buttons of a mouse 

' - , derived firom the motion of the object positioUi The processor may be c , - 

• , \ . ■ '. ■ ' ' • "■ " • ^ 

emulate buttons of a mouse based upon a sustained position of the control object in any 

. ^position within the object detection region for a predetennined^^ In. other 

instances, the processor may be configured to emulate buttons of a inouse based upon a 

■ ■ ' . • r . . ■ ■ . . 

25 position of the position indicator being sustained within the bounds of an interactive 

display region for a predetermined time period. The processor may be configured to map 

... a z-axis depth position of the control object relative to the video cameras to a virtual z- 
axis screen coordinate, of the position indicator. ; 

. The processor may be configured to map a x-axis position of the control object 

30 relative to the video cameras to a X-axis screen coordinate of the ppsiti^ 
y-axis position of the control object relative to the video cameras 
coordinate of the position indicator, and niap a z-axis depth position of the control object 
relative to the video cameras to a vhtual z-axis screen coordinate of the position indicator. 

, . . ' . ■ • . _ _ • .. . , 

. ' . ■ ■ " •- • ' ■ ' ' ■ t ■ . 
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A position of the position indicator being w 
, display region may trigger an action witi^^ Movement of the 

control object along a z-axis deptii position that covers a predetennined distance within a 
predetermined time peripd may trigger a selection action within the application program. 
5 A position of the control object being sustained in any position within the object 

detection region for a. predetermined time period may trigger part of a selection action 

within the application progrm. 

■ ■ -f ... . - , 

In another aspect, a istereo vision system for interfacing vwth an appU 
. ; program running on a computer is disclosed. The stereo vision system includes first and ; 

10 second video cameras arranged in an adjacent configuration and operable to produce a 

■ ■ ■ ' •■ - ' • / " ' ' ' . ' ' : 

^ series of stereo yideo images. A processor is operable to receive the series of stereo video 

. images and detect objects appearing in the interse:Cting field of view of the cameras. The 

processor executes a process to define an object detection region in three-dimensipzial 

coordinates relative to a position of the first and second video cameras, select as a control 

.M5 • object a detected object appearing closest to the video cameras and within tiie object 

detection region, define sub regions within the object detection region, identify a sub 

- *' • ' ' ■ ' . ■ • ■ ■ . , " 

region occupied by the control object, associate with that sub region an action that is . ^ 
activated when the control object occupies that sub region, and apply the action to . 
interface with a computer application. " ^ 

20 The actioii associated with the sub region is further defined to be an emulation of 

the activation of keys associated with a computer keyboard. . A pos^^ 
^ object being sustained in any sub region for a predetermined time period may trigger the 
action. : . ■ _ . - V. 

Ir , T • . » , ■ ' M ' • ' - 

* - ,1- • 

■ ; * ' ^ . ' ■ - . , 

hi yet another aspect, a sterep vision system for interfacing with an application 

" ■■ ' ■ ■ ' - " '" ' ' ^ ' ' ' . • " ' 

25 . - jrogramnmningonacdmputer is disclosed. First and second 

ariranged in aii adjacent configuration and are operable to produce a series of stereo video 

images. A processor is operable to receive the.series of stereo video images and detect. 

objects appearing in an intersecting field of view of the cameras. The processor executes 

. a process to identify an object perceived.as the largest object appearing in 

. 30 . field of view of the cameras and positipned at a predetermined depth range, select 

object as an object of interest, determine a position coordinate representing a position of 

" th^ object of interest, and use the position coordinate as an object control point to control 

the application program. 
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. . ' " ' . - ■ • [ ' ' • ' ' 

The process also may ca.use the processor to determine and store a neutral control 
-•■ ' . ' 1. • ' " "■ ■ -■ . ' 

point position, map a coordinate of the object control point relative to the neutral control 

point position, and use the niapped object control point coordinate to con^^^ 

application program. . 

5 The process may cause the processor to define a region having a position based 

Upon the position of the neutral control point position, map ttie object control point 

• ' ' ' , " ' ' " 

' . ^ . - ■ - - - 

relative to its position within the region, and use the mapped object control point 

coordinate to control the application prbgr£un. The process also may cause the processor 

to transform the mapped object control point to a velocity functipn^ determine a viewpoint 
10 associated with a virtual environment of the application program, and use the velocity 

function to move the viewpoint within the virtual environment. . 
The process may cause tilie processor to m^ 

point to control a position of an mdicator within the appUcation progra^^ 

implementation the indicator may be an iayatar. 
16 The process may CAuse the processor to map ^ 

' point to control an appearance of an indicator within the application program. In this 

implementation the indicator may be an avatar. The object of interest may be a human . 

appearing within the intersecting field of view. 

" • - ' . ' ■. ' .... 

In Another z^pect, a stereo vision system for interfacing with an appU^^ 
20 program r unnin g on a computer is disclosed. The stereo vision system includes first and 
second video cameras arranged in an iadjacent configuration and operable to produce a 
series of stereo video images. AprocessorisoperaWeto.receivetheseriesof stereo video 
unages and detect objects appearing in an intersecting field of view of the camera^. The 

'r' ■ '*,■(" ' . . ' ■ ■ . . . ■■- 

processor executes a process to identify an object perceived a$ the largest object 

25 appearing in the intersecting field of view of the cameras and positioned at a 

predetermined depth range, select the object as an object of interest, define a coi^trol 

> region between the cameras and the object of interest, the control region being positioned 

at a predetermined location and having a predetena^ 

locatidn of the object of interest, search the control region for a poiiit associated with the 
30 ' object of interest that is closest to the cameras and within the control region, select the 
point associated with the object of interest as a control point if the point associated with 
' the object of interest is within the control region, and map position coordinates of the 
' . , control point, as the control point moves withm the control region^ 
associated with the application program. 
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The processor may be operable to map a horizontal position of the control point 
relative to the video cameras to a x-axis screen coordinate of the position indicator, map a 
vertical position of the control point relative to the video cameras to a y-axis screen 
coordinate of the position mdicator, and emulate a mouse function using a combination of 
the X-axis and the y-axis screen coordinates. 

Alternatively, the processor also rnay be operable to map a x-axis position of the 
control point relative to the video cameras to a x-axis screen coordinate of the position 
indicator, map a y-axis position of the control ppmt relative to the video cameras to a y- 
axis screen coordinate of the position indicator, and map a z-axis depth position of the 
control point relative to the video cameras to a vutual z-axis screen coordinate of the 
position indicator, , \ 

hi the stereo vision system, the object of interest may be a hxunan appeam 
within the intersecting field of view. Additionally, the control point may be associated 
wth a human hand appearing within the control re^ 

In yet another aspect, a stereo vision system for interfacing with an application 

' ■ . ' ; ■ / ' ■ . ' ' . ' • . , ■ " 

program hmning on a computer is disclosed. First and second video cameras are < 
arranged in an adjacent configuration and are operable to produce a series 
images. A processor is operable to receive the series of stereo video images and detect 
objects appearmg in an intersecting field of yiew of the cameras. The processor executes 
a process to define an object detection region in three-dimensional coordinates relative to 
a position of the first and second video cameras, select up to two hand objects fiom the 
objects appearing in the intersectmg field of view that are within the object detection 

I, . ■ . ■ . , 

region, and map position coordinates of the hand objects, as the hand objects move within 
the objiect detection region, to positions of virtual.hands associated with an avatar 
rendered by the application program. \ • / . 

The process may select the up to two hand objects fi:om the objects appearing in 
the intersecting field of view that are closest to the video cameras and within the object 

fc - I - 

detection region. The avatar may take the foim of a humanrlike body. Additionally, the 

avatar may be rendered in and interact with a virtual environment forming part of the 

,1.1 • . ■ ' . ■" 

application program^ The processor may execute a process to compare the positions of 
the virtual hands assiociated with the avatar to positions of virtual objects within the 
. virtual environment to enable a user to interact with the virtual objects within tiie virtual 
, environment. 
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The processor also may execute a process to detect position coordinates of a: user 
within the intersecting field of view, and map the position coordinates of die user to a 
virtual torso of the avatar rendered by the application program. The process may move at 
least one of the virtual hands associated with the avatar to a neutral position if a 
corresponding hand object is not selected. » 

The processor also may execute a process to detect position coordinates of a user 

■ * . ■ 

within the intersecting field of view, and map the position coordinates of the user to a 
velocity function that is appUed to the avatar to enable the avatar to roam through a 
virtual environment rendered by the application program. The velocity function may 

include a neutral position denoting zero velocity of the avatar. The processor also may 

'. .1 • • • 

execute a process to map the positioh coordinates of the user relative to the neutral 
position into torso coordinates associated with the avatar so that the avatar appears to V 
lean. 

The processor also may execute a process to compare the position of the virtual 

■■ . ■ . ■■■ " ' ' " " ■. ■ . ■ ■ ■ • -. 

hands' associated with the avatar to positions of 

environment to enable the user to interact with the virtual objects while roaming through • 
the virtual environment \ ^ • 

As part of the stereo vision system, a virtual knee position associated with the 
• ■ ■ • . ■ ' ' ' ' ' ■ 

avatar may be derived by the appUcation program and used to refine an appearance of the 

• ft " * . m ~ '• 

avaiiar. Additionally, a virtual elbow position associated with the avatiar may be derived 

by the application program and'used to refine an appearance of the avatar. 

> ' ' ' *■ . ' " 

. ■ The details of one or more implementations are set forth in the accompanying 

drawings and the description below. Other features and advantages will be apparent firom 

the description and drawings, and fiom the claims. 

DESCRIPTION OF DRAWmGS 
., Fig. 1 shows tiie hardware components and environment of a typical 

' .1 ' 

iinplementatioh of a video-based image control system. . 

, Fig. 2 is a flow diagram generally, describing the processing technique employed . 
by the system of Fig, 1. - . , 

Fig. 3 is a diagram showing the field of view of each camera associated with the 
„ - ■ I ■ ■ ■ , ■ " 

video-based image control system of Fig. 1. 

• ' ■ ' ■ , " ' • ' ■ ■ ' , , ' . •■. 

Fig. 4 shows a conunon point of interest and epipoto lines appearing ^ 

video images produced by a stereo camera device. 
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Fig. 5 is a flow diagram showing a stereo processing routine used to produce 
scene description infonnation from stereo images. ! 

Fig. 6 is a flow (Uagram showng a proc 
information into position and orientation 
5 Fig. 7 is a graph showing the degree of damping S as a function of distance D 

~ * * I 

expressed in terms of chaise in position. 

Fig. 8 shows an implementation of the image control system in which an objeict or 
hand detection region is established directly in front of a computer monitor screen.. . 
Fig. 9 is a flow diagram showing an optional process of dynamically defining a 
10 hand detection region relative to a user's body. ' 

Figs lOA-lOC illustrate examples of the process of Fig. 9^ ^f^^ 
defining the hand detection region relati^ 

• . jFig. 1 1 A shows an exemplary user interface and display region associated with the 

video-based image control system. , . ' 

• - -\ ., . ■:■ ..^ ■■ . ■ . ■ . 

15 • _ Fig. IIB showsateclmiqueformapping ahmdorpo ; 

* ■.••" h ■" . . . " . - ■ . . ■ .' ' - 

region associated wth the user in^ 

, . Fig. 12A illustrates an exeniplary threerdimensional user interface represented in a 

virtual reality environment. 

. \ • Fig. 12B illustrates the three-dimensional user interface of Fig. 12A in which 

20 contents of a virtual file folder have been removed for viewing. 

_ ' ' ' ' . ' I ■ 

* - " t ' " ! - ^ . « 

Fig. 13 A illustrates an exemplary riBpresentation of a three-dimensional user 
interface for navigating through a virtual three-dimensional room. 

Fig. 13B is a graph showing coordinate regiom wMch are repre^^^ 
image control system as dead zones, in wMch there is no; implied c 

■2!5' \p63itionJ\- ^ . 

Fig. 14 shows ah exemplary implementation of a video game interface in wlu^ 

motions and gestures are interpreted as joystick type navigation control functions for 
flying through a virtudthree-dimensioiiald * . 

. ' Fig. 15A is a diajgram showing an exemplary hand detection region divided into 

' ; " 1 " . . • " 

■ : . ■ . - ■ ' - 

30 ' detection planes. . . . , 

' Fig. 1 53 is a diagrain sho\ying an ex^plary hand detection region divided mto 

detection boxiss. . / 
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Figs. 1 5C and 1 5D are diagrams showing an exemplary hand detection region 
divided into two sets of direction detection boxes, and further show a gap defined 
between adjacent direction detection boxes. 

Like reference synibols in the various drawings indicate ii^^ 

■!.''<""■ " ..■"..<.-.■.' ■ • ■ - . ■ 

; ' , •■ - 

■; ■ ' ■ ... ■ ■ • -. - • . ■ ■ . .. 

5 DETAILED DESCRIPTION ' 

Fig. 1 shows one implementation of a video-based image control system 100. A 
person (or multiple people) 101 locates him or herself in, or reaching with his hand or 
hands into, a region of interest 102. The region of interest 102 is positioned relative to an 
image detector 103 so as to be in the overall field of view 104 of the image detector. The 
.10 . region of interest 102 contains a hand detection region 105 within which parts of the 
person's body, if present and'detectable, are located and their positions and motions 
measiired. The regions, positions and measures are expressed in a three-dimensional x^ y, 
z cooridinate or world-coordinate system 106 which does not need to be aU 
image detector 103. A series of video iinages.generated b^ 

15 processed by a computing apparatus 107, such as a personal computer, capable of 
displaying a video image on a video display 108. / . ' 

As will be described in greater detail below, the computmg apparatus 107 
. processes the series of video images in order to analyze the position and.gestures of an 
objectsuch as the user's hand. The resisting position and gesture infonnation then is 

20 mapped into an application program, such as a graphical user interface (GUI) or a video 
game. A representation of the position and gestures of the user' s hand (such as a screen 
pointer or cursor) is presented on the video display 108 and allows functions within the 
GUI or video game to be execiited and/or controlled M exemplary .function is moving 
the cursor over a screen button and receiving a "click" or "press" gesture to sel^^ 

25 screen button. The function associated with the button may then be executed by the 

computing apparatus 107. The image detector 103 is described in greater detail below. 
System 100 may be implemented in a variety of configurations in^^^ 
configuration where the image detector 103 is mounted on a tdp surface of the video 
display 108 for viewing the region of interest 102, or alternatively an overhead camera 

30. configuration where the image detector 103 is mounted on a support structure and 
positioned above the video display 108 for viewing the region of interest 102. 

Fig. 2 shows the video image analysis process 200, that may be implemented 
through computer software or alternatively computer h^dw^^ 
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implementation.of the system 100. The image detector or video cameni 103 acquires 
stereo images 201 of the region of interest 102 and the surrounding scene. These stereo 
images 201 are conveyed to the computing apparatus 107 (which may optionally be 
incorporated into the image detector 103), which performs a stereo analysis process 202 

... - * - 

5 on tlie stereo images 201 to produce a scene description 203. From the scene description 
203, computing apparatus 107 or a different computing device, uses a scene analysis 
process 204 to calculate and output hand/object position information 205 of the person's 
(or people's) hand(s) or other suitable pointing, device and optionally the positions or 
measures of other features of the person's body. The hand/object position information 
10 205 is a set of three-dimensional coordinates that are provided to a position mappmg 
process 207 that maps or transforms the threb-di^^ 

. screen coordinates. These screen coordinates produced by the position mapping process 

■ . ' I • ■ -,. , ' ■ . ■ ' .1,. I. ' ' ' 

207 can then be used as screen coordinate position information by an application program 

208 that nuifi on the computing apparatus 107 and pro vid^^ ' 
,15 v ' Certain motions-made by the hand(s),wMc 

position of Ae hand(s) and/or other features represented as , 

■ ' • . - I 

information 205, may also be detected and interpreted by a gesture; analysis and detection 
.process 209 as gesture information or gestures 211. The screen coordinate position 
information from die position mapping process 207 along with the gesture inforinatipn 
20 211 is then coinmuiiicated to, and used to control, the application program 208. 

The detection of gestureis may be contejrt jsei^ 
State 210 may be used by the gesture detection process 209, and the criteria and meaning 
^ of gestures may be selected by the applicatioh program 208. An example, of an • , 
; ; application state 210 is a condition where the appearance of the ciirsor changes 
25 . depending upon its displayed location on the video screen 108, Thus, if the user moves 
the cursor from one screen object to a different screen object^ the icon representing the 
cursor may for example change from a pointer icon to a hand icon. Typically, the user 

' - , • .4 

receives feedback 206 as changes in the image presented on tiie video display 108.. In 
general, the feedback 206 is provided by the apphcation program 208 and pertains to the 
30/ : bandpositionandlhe state of the application on the video display 108^ 

The linage detector 103 and the computing device 107 produce scene description 
information 203 that includes a three-dimensional position, or infonnatipn from which the 
three-dimensional position is implied, for all or some subset of the objects or parts of the 
objects that make up the scene. Objects detected by the stereo cameras within the image 
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detector 103 may be excluded fiom consideration if their positions lie outside the region, 
ofmterest 102, or if they have shape or other qiiaUties^m^ 
a person in a pose consistent >yith the typical use of the system 100. As a result, few 
limitatipns are unposed on the environment in which the system inay operate. The 
5 environment may eyen contain additional people who axe not interacting with the system. 
This is a unique aspect, of the system 100 relativie to other tracking systems that require 
that the parts of the image(s) that do not make up the user, that is the background, be 
. static ajad/or modeled. 

Also, few Umitations are imposed on the appearance of 
10 ,Uie general three-dunensional shape of the person and arm that is used to identify the 

hand. The user 101 may even wear a glove or nutten whUeoperatin This is 

also aunique aspect of system 100, as compared to other tracking systems that make use 
of the appearance of the hand, most commonly skin color, to identify the hand. Thus, 
system 100 can be considered more robust than methods relying on the appearance of the 
; 15 user and hand, because the appearance of bodies and hands are highly variable among 
poses and different people. Howev€?r, it shoidd be noted that appearance may 
some implementations of the stereo analysis process 202 that are copipatible with the 
• system 100.. . 

Typically, the scene description Monnation 203 is produced through the use of 
20 stereo cameras. In such a system, the image detector 103 consists of two or more . 
i J individual cameras and is referred to as a stereo camera head. The cameras may be black 
and white video carneras or inay alternatively be color video cameras. Each individual 
camera acquires an image of the scene from a unique viewpoint arid produces a series of ' 
videoimages. Using the relative positions of parts of the scene of each cameraman 
26 computing device 107 .caii infer the distance of the object from the image detector 103, as 

I. * '■ ' ■ • 1. • ^ ^ ^- t - ■ i 

desired for the sceine description 203: . r 

. An implementation of a stereo camera image detector 103 that has been used for 
this^system is described in greater detail below. Other stereo camera systems arid 
algorithms exist that produce a scene description suitable for this system, and it should be 
30 understood that it is not intended that this sy stern be limited to using the pggrticular stereo 

sysftem described herein. 

Turning to Fig. 3 each camera 301, 302 of the image detect^^ 
head 103 detects and produces an image of the sicene that is within that cainera's field of 
view 304, 305 (respectively). TTieoveiaU field ofyiew 104 is defined, as thb 



14 
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■ • ■ ' . ■ . ' ■ . 

of all the individual field of views 304, 305. Objects 307 within the overall field of view 
, 104 have the potential to be detected, as a whole or in parts, by all the cameras 301, 302. 
The objects 307 niay not necessarily Ue within t^^ 

permissible because the scene description 203 is permitted to contam objects, or features 
5 of objects, that are outside the region of interest 102. With respect to Fig. 3, it should be 

noted that the hand detection region 105 is a subset of the region of interest 102. 

With respect to Fig. 4, each image 401 and 402 of the pair of images 201, is 

detected by the pair of cameras 103. There exists a set of lines in the image 401, such 

that for each line 403 of that set, there exists a corresponding line 404 in the other image 
10. 402. , Further, any common point 405 in the scene that is located on the line 403, will also 

• ' i » ■ ■ ■ . ' ■ 

be located on the corresponding line 404.in the second camera image 402, so long as that 
.\ point is within the overall field of view 104 and visible by both cameras 301, 302 (for 

'- j'j'** . ' • - 

II ■ , , , 

example, not occluded by another object in the scene). These lines 403, 404 axe referred- 

■ ' ■ ■ . , ■ ■ . . ■ . 

toasepipolarlines. the difference in position of the point on eac 

' ' vV ' ' . . .. ." " - - 

15 of the pair is referred to as disparity. 'Disparity is inversely proportional to distance, and '. 

■ " ' • ' . ■ ■ ■ . ' . 

therefore provides informiatidn required to produce the scene description 203; 

The epipolar line pairs are dependent on the distortion in the cameras' images and 

the geometric relationship between the cameras 301, 302. These prpperties are 

■■ ■ . ■ '" - . 

determined and optionally analyzed through a pre-process referred to as calibration^^ 

20 . system must account for the radial distortion introduced by the lenses used on most 
cameras^ One technique for resolving those c^ 

[■ - ■ radial- distortion is presented in Z. Zhang, A Flexible New Technique for Camera 
Calibration, Microsoft Research, http://research.microsoft.com/'-zhang. which is 
incorporated by reference, and may be used as the first step of calibration. This technique 

25 will not find the epipolar lines, but it causes the lines to be straight, which simplifies ' 
finding th^. A subset of the methods described in Z. Zhang, Determining the Epipolar 
Geometry, and its Uncertainty: A Review, The International Journal of Computer Vision 
iPP7, and Z. Zhang, Determining the Epipolar Geometry and its Uncertainty; A review. 
Technical Report 2927, INRIA Sophia Antipolis, France, July 1 996, both of which are . 

30 incorporated by reference, may be applied to solve the epipolar lines, as the second step 

-* ■ ' . '• ' 

of calibration. y 

One implementation of a stereo analysis process 202 that has been used to 
produce the scene description 203 is described in Fig. .5. The image pair 201 includes a 
reference image 401 and a comparison image 402. Individiml iD^^g^s 401 and 402 are 



wo 02/07839 ;, PCT/llsOl/23224 

a 

liltexed by an image filter 503 and broken into features at block 504. Each feature is 

represented as an 8 x 8 block of pixels. However it should be understood that the features 

may be defined in pixel bloclcs fliat are larger or smdler than 
.accordingly. . / 

5 A matching prociess 505 seeks a match for each feature in the reference image. To 

this end, a feature pomparison process 506 compares each feature in the reference image 

1 - . -I I 1 

to all features that lie within a predefined rarige along the corresponding epipolar Ime, in 

the second or comparison image 402. In this particular implementation, a feature is 
' " ■ ' . * ' ' " ' , ■ " . ■ , 

defined as an 8 x 8 pixel block of the image 401 or 402, where the block is expected to 

10 contain a part of an object in the scene, represented^ 

(which, due to the filtering by the image filter 503^ may not directly represent luminance) 

' i_ ■. ■ - . ' •" • 

within the block. The UkeUhood that each pair of features matches is recorded- and 

,' ' . . '■ • • ■ " ' '. '■• ' ■ ' ■ 

indexed, by the disparity, Blocks vyithin the reference image 401 are eliminated by a 

feaUire pair filter 507 if the best feature pair's likelihood of a match is weak (as compared 

- - . ■ 1 I, I , 

15 . .t6 a predefined threshold), or if multiple feature pairs hkve similar likelihood of being the 
1 ■ t ... - 

best match (where features are considered similar if the difference in their likelihood is 

' - ■♦ - . ' • * ^ ■ 

within ^ predefined threshold): Of remaining reference features, the likelihood of all 

' , ■ ' '■. " " . . 

feature pairs is adjusted by aneighborhood support process 508 by an ampi^^ 
proportional to the likelihood found for neighboring reference features with feature pairs 
20 of similar disparity. For each reference feature, the feature pair with the best likelihood 
may now be selected by a feature pair selection process 509, providing a disparity (and 
hence, distance) for each referent . ; . 

Due to occlusion, a reference feature (produced by process 504) may not be 

'j ' * » • t » ' , » - • ■ ' ' 

' \ represented in the second or comparison image 402 and the most likely matching feature 
25 that is present will'be eiTonepus. Therefore, in a two camera system, the features selected 
in comparison image' 402 axe examined by a siir^ 
. . 506, 507, 508, and 509 m a second parallel matching process 510) to determine the best 

matching features of those ia reference image 401, a reversal of the previous roles for 
. images 401and 402. In a three camera system (i.e., a tiurd camera is 
30 cameras 301 and 302), tlie third camera's image replaces the comparison image 402, and 

\ . ' ■ - 

the original reference image 401 continues to be used as the reference image, by a similar 
procedure (by applymg processes 506, 507, 508, and 509 in the second parallel matching 

■ " i , - ' , 

• process 510) to determine the best matching features of those^ in the third unage. If more 
than three cameras are available, tiiis process can be repeated for each of the additional . 
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camera images. Any reference feature whose best matching paired feature has a more 
likely matching feature in the reference image 401 is elimioated in a comparison process 
51^. As a result, many erroneous matches, and therefore erroneous distances, caused by 
occlusion are eliminated. 

ITie r^ult of the above procedure is a depth description map 512 that describes 
the position and disparity of features relative to the images 401, 402. These positions and 
disparities (measured in pixels) are transformed by a coordinate system transforroation 
process 513 to the arbitrary three-diinensiorial world cpotdinate system (x, y, z coordinate 
system) (106 of Fig. 1) by applying Eq, 1, Eq. 2 and Eq. 3, which are presented below. 
Disparity can be difficult to work with because it is non-linearly related to distance. For 
this reason, these equations generally are applied at this time so that the coordinates of the 

• scene description 203 are described in terms of linear distance relative to the world. . 

■ '. „..'. \ ■ 1 ■»'«.• ' 
coordinate system, 106. Application of these equations, however, will re-distribute the , 

coordinates of the features in such a way that the density of features in a region will be 

affected, whi^l^ niakes the process of clustering fei^es (performed in a later step) more 

difiicult. Therefore,: the original unage-based coordinates typically are maintained along 

t ' ~ - ^ ' ■ . 1., • 

I • ■ ' 

I ' . ■ . . . . 

with the transformed coordinates;: ■ / 

• This transformed depth description map produced by transforcnation process 513 

is the. scene description 203 (of Fig. 2). It is the.tscsk of the scene analysis process 204 to 

- . . • _ ^j. 
make sense of this information and extract useful data. Typically, the scene analysis 

„ process 204 is dependent on the particular scenari9 in which this system is. applied, 

. . . Fig, 6 presents a flow diagram that summarizes an irnplementation of the scene. . 

ianalysis process 204. . In the scene aipalysis process 204, features withiii the scene- 

description 203 are filtered by a feature cropping module 601 to exclude features with 

I- " ' ' - ' ' 'V 

- 'l- 1 ■ . , 

positions that indicate that the features are unlikely to belong to the user or are outside the 
region of interest 102. Module 601 also eliminates the backgroutid and other 
"distractions'- (for example, another person standing b^ • * 

' ■ r, - ■ ' . f . 1—1- . _ . , . . 

Typically, the region of interest 102 is defined as a bounding box aligned tp the 
world-coordinate system 106. When this is the case, module 601 may easily check 
whether the coordiiiates of each feature are within the bounds 

Often, parts of the background can be detected to be within the region of interest 
102, or a box-shaped regipii of interest may be incapable of defmitively separating the 

r ' > ■ ' » ' ' ' - ^ 

user 101 from the background (particularly in confined spaces). When it is known that no 
user is within the region of interest 102, the scene description 203 is optionally sampled 
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and modified by a background sampling modtde 602 to produce a background reference 
603. The background reference 603 is a description of the shape of the scene that is 

' ■ ■ . - I 

invariant tip changes in the appearance of the scene (for example, changes in 
illumination). Therefore, it is typically sufficierit to sample the scene only when the 

5 system 100 is setup, and that reference will remain vaUd as long as the structure of the 
scene remains unchanged. The position of a feature forming part of the scene may vary 
by a small amount over time, typically due to signal noise. To assure that the observed 
background remains within the shape defined by the background reference 603, the 
background ssmipling module 602 may observe the scene descn^ 

10 period of time (typically 1 to 5 seconds), and record the features nearest to the cameras 
. 103 for all locations. Furthermore, the value defined by those features is expanded 

fiirther by a predetermined distance (typically the distance corresponding to a one pixel 
, change in disparity at the features' distances). Once 
background reference 603 can be compared to scene (descriptions 203, and any features 

15 . \vi1hixi the scene description 203 that ace on px behind the backgroimd reference are 
removed by the feature crapping module 601. 

After feature cropping, the next step is to cluster the remaining features into 
collectioiis of one or more features by way of a feature clustering process 604. Each 
feature is compared to its neighbors within a predefined range. Features tend to be 

20 distributed more evenly in their image coordinates than in their transformed coordinates, 

^ 

SO the neighbor distance typically is measured usmg the iniage coordinates. The 
niaximum acceptable range is pre-defined, a^^ 

analysis process j such as stereo analysis process 202, that is used. The stereo analysis 
process 202 described above produces relativdy dense and evenly 
25 aiid therefore its use leads to easier clustering than if some other stereo processing 
.. techniques are used. Of those feature pairs ^t meet the criteria to be corisidered . 

neighbors, their nearness in the ms m^ 

■ ■ ' , " *i ' ■ . ■ - ■ ' . - ■ 

scenarios where the cameras are positioned in front of the region of interest, or the y-axis 
in those scenarios where the cameras are positioned above the region of interest) is 
30 checked against a predefined range, k cluster may include pairs ojE features that do not 
meet these criteria if there exists some path through the cliK 
features such tliat the pairs of features along this path meet the criteria. ; 

. Continuing with this implementation, clusters are filtered using a cluster filtering • 
process 605 to assure lhat the cluster has quaUties consistent wi1h ob^ 
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expected to be present within the region of interest 102, and are not the result of features 
whose position (or disparity) has been erroneously identified in the stereo processing 
routine. Also, as part of the cluster filtering process 605, clusters that contain too few 
features to pro vide a confident measure of their size, shape, ,or position are eliminated. 
.5 Measurements of the cluster's area, bounding size, arid count of features are made and 
compared to predefined thresholds that describe.minimum quantities of these measures, 
, Clusters, and their features, that do not pass these criteria are removed fi:oni further 
consideration. ■ ■ 

The presence or absence of a person is deterroined by a presence detection module 

■t ' ' ' • ' 

10 606 in this implementation. The presence detection module 606 is optional because the 

. inforinatioh that this component provides is not required by all systems. In its simplest 
; form^ the presence detection module 606 need only check for the presence of features (not 
. previously eliminated) within the bounds of a predefined presence detection region 60.7. 
The presence detection region 607 is any region that is likely to be occupied in part by 
15 isome part of the'user 101, and is not likely to be occupied by any object when the user is 
notpresent. The presence detection region 607 is typically coincident to the region of . 
, interest 102. In specific installations of fWs system, however, the presence d^^ 
region 607 may be defined to avoid stationary objects within the scene. In 
' implementations where this component is applied, further processing may b^ skipped if 
20 no user 101 is found. 

. ; In the described implementation of system 100, a hand detection region 105 is 
defiiiedi The method by which this region 105 is defmed (by process 60?) is dependent 
on the scenario in which the system is applied, and is discussed m greater. detail below. 
That procedure may optionally ianalyze the user*s body and retum additional information 
25 including body position(s)/nieasure(s) information 610, such as the position of the 
person's head. 

■■ . ' ' ' , . ■ ■■" ■ • . - ■■ >. " 

The hand detection region 105 is expected to contain nothing or 

hand(s) or suitable pointer. Any clusters that have not been previously removed by 
» filtering and that have features within the hand detection region 105 are considered to be, 
30 or include, hands or pointers. A position is calculated (by process 611) for each of these 
clusters, and if that position is within the hand detection region 105, it is recorded (in 
memory) as hand position coordinates 612. Typically, tiie position is nieasured as a , . 
weighted mean. The cluster's feature (identified by 1005 of the example presented m Fig. 
10) that is furthest from side of entry (1002 m tiiat example) of the hand detection region , 
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' ■ . ■ ■ . .- ■ ■ • 

• ■ ■ ■ . ' -. ■ . ' 

105 is identified, and its position is given a weiglit of 1 based on the assumption that it is 
likely to represent the tip of a finger or pointer. The rernaining weights of cluster features • 
' are based, on the distance back firom tMs feature, using the form^^ 

below. If only one hand position is required by the application and multiple clusters have 

t , . • , . * " ■ ■ ■ . 

5 . features within the hand detection region 105, the position that is fijrthest fi:om the side of 
entry 1002 is provided as the hand position 612 and other positions stre discarded, 
Therefore, the hand that reaches furthest into the hand detection region 105 is used. 
. Otherwise, if more than two clusters have features within the hand detection region 105, 
. the position that is furthest from the side of entry 1002 and the position that is second 
10 fiirthest fix)m the side of entry 1002 are provided as the hand positions 612 imd other 
positioiis are discarded/ Whenever these rules cause a cluster to be incl^^ 
different cluster, the included clusters are tagged as such in the hand position data 612. 
In those scenarios where the orientation of the cameras is such 
. arm is detectable, the orientation is represented as hand orientatioii coordinates 613 of the 

-" ' 1 . . - - ^ ^ ^ 

15/ arm or pointer, and may optionally be calculated by a hand orientation calculation inodule 
614. This is the case if the elevation of the cameras 103 is sufficienfly highrelative to the 
hand detection region 105, including those scenarios where the isameras 103 are directiy 
above the hand detection region 105. The orientation may be represented by the principal 

• . . i^' , - - . . - 

' • ■ ' . - . I -■ , 

axis of the cluster, which is calculated from the moments of the cluster: • 
20 . . An altema-tive method that also yields good resiilts, in particular when the features 

are not evenly distributed, is as follows, the position where the arm enters the hand 
detection region 105 is found as the, position where the cluster is dissected by the plane?: 
formed by tiiat boundary ofthe hand detection region 105. The vector between that ^ 
, . position and the hand position coordinates 612,provides the hand oiieiitalion coordinates 

" 25. ' 613, ■ ■ • ^ : " ■• : . 

■- , ■ - ■ • - •\ . ' 

A dynamic smoothing process 615 may optionally be applied to the hand position 

co6rdinate(s) 612, the hand prientatioh(s) coordinates 613 (if solved), and any additional 
bodypositions or measures 610. Smootiiingis aprocessof combining the results with 
those solved previously so that motion is steady from frame to frame. The one particular 
30 of smoothing for these particular coordinate v£dues, each of the components 

coordinate, that is x, y, and z, are smoothed mdependentiy and dynamically. The degree 
of dampening S is calculated by Eq. 5, which is provided below, where S is dynamically 
. and automatically adjusted in response to the change in position. Two distance 
. thresholds^ Da and Db, as. shown in Fig. 7, deJBne tiiree r^ges of motion. For a change in 

I ' ' ' ' . 

.. ■ ■ ■ ■•■ ' ..■ ■ .' ■ ' . . 20 . ^' : •. 
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position that is less than Da» motion is heavily dampened in region 701 by Sa^ thereby 
reducing the tendency of a value to switch back and forth between two nearby values (a 
side effect ofthe discrete sampling of the images). A change in position greater than Pb 
is lightly dampened in region 702 by Sb, or not danapencd. Thiis reduces oi: eliminates lag 

5 and vagueness that is introduced iii some other smoothing procedures. The degree of 

dampening is varied for motion between Da and Pb, the region marked as 703, so that the 
transition between light and heavy dampening is less noticeable. £q. 6, which is provided 
below, is used to solve the scalar a, which is used in Eq. 7 (also provided below) to 
modify the coordinate(s). The result of dynamic smoothing process 615 is the , 

1.0 hand/object position informatioii 205 of Fig. 2. Smoothing is not applied when process 
611 has tagged the position as belonging to a different cluster than the previous position, 
since the current and previous positions are independent. 

The.descnbed method by winch the hand detection r 
. step 609 is dependent on the scenario in which the image control system 100 is applied. 

15 Tvvq scenarios are discussed here. 

The simplest hand detection region 105 is a predetermined fixed region t^^^ 

. .. • ' ■■ ... ' ' '. ■ 

expected to contain either nothing or only the person's hand(s) or pointer. One scenario 
where this definition applies is the use of system 100 for controlling the user interface of 
a personal computer, where the hand detection region 105 is a region in front ofthe 

20 computer's display monitor 108, and above the computer's keyboard 802, as depicted in 

■ ■ , ■ ■ ■ ' . . • • ^ 

Fig. .8: In the traditional use of the computer, the user's hands or other objects do not : 

normally enter this region. Therefore, any object found to be moving within the hand 

detection region 105 may be interpreted as an effort by the user 101 to perform the action 

of "pointing", using his or her hand or a pointer, where a pointer may be any object 

25 suitable for performing the act of pomting, including, for example, a pencil or other 
' " ■ . ' ' ■ , ' . ■ . . ■ ■• 

suitable pointing device. It should be noted that specific unplementation of the stereo 

analysis jprocess 202 rnay impose cpnstrfiiints on the types or appearance of objects used 

as pointers. Additionally, the optional presence detection region, discussed above, may 

* ' " , ' * . . , 

be defined as region 801, to include, in.this scenario, the user's head. The image detector 
30 103 may be placed'above the monitor 108. 

In some scenarios, the hand detection region 105 may be dynamically defined 
relative to the user's body and expected to contain either nothing or only the person's 
. hand(s) or pointer. The use of a dynamic region removes the restriction that the user be 
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positioned at a predetennined position. Fig. 1 depicts a scenario in which this 
implementation may be enaployed. 

» ' ' " ' ^ ' 

Fig. 9 shows an implementation of the optional dynamic hand detection region 
' positioning process 609 in greater detail. In this process, the position of the hand 
5 detection region 105 on each of three axes is solved, while the size and orientation of the 
hand detection region 105 are dictated by predefined specifications. Figs. lOA-lOC 
present an example that is used to help illustrate this process. 

Using the cluster data 901 (the output of the cluster filteriiig process 605 of Fig. 
: 6), the described procedure involves finding, in block 902, the position of a plane 1001 
1 0 (such as a torso-divisioning plane illustrated in the side yiew depicted in Fig. IOC) whose 
. orientation is parallel to the boundary 1002 of the hand detectioiL region 105 through 
which the user 101 is expected to reach. If the features are expected to be evenly 

'i ' ' . ' ' . - 

distributed over the original images (as is the case when the implementation of the stereo 
analysis process 202 described above is used), then it is expected that the majority of the 
15 remaining features will belong to the user's torso, and not his hand. In this case, the plane 

■ J- ■ ... -■ ' . . ' ', 

1001 may be positioned so that it segments the features into two groups of equal count. If 
the features are expected to be unevenly distributed (as is the ciase when'some alternative 
implementations of the stereo analysis process 202 are used), theii the above.assumption 
may not be true. . However, the majority of features that form the outer bounds, of the . 

20 cluster are still expected to belong to the torso. In this case, the plane 1001 inay be 

positioned so that it segments the outer-most features into two groups of equal count. In 

. . either case, the plane 1001 will be positioned by thel. torso-divisionmg process in block' 

m' I" " : ' 

902 so tlmt it is likely to pass through the user's torso. 

Process block 903 determines the position of the hand detection region 105 along 
25 the axis that is defined normal to plane 1001 foxmd above. The hand detection region 105 
is defined to be a predeteririined distance 1004 in front of 

firoht of the user's body. In the case of Fig. 1, distance 1004 determines tiie position of 
the hand detection region 105 along the z-axis. 

If the user's head is entirely within the re^on of intere 
30 the topmost feature of the cluster is expected.to represent the top of the usef's head (and 

therefore to imply the user' s height), and is found in process block 904 of this 

' ' ' ' . "■ ' . ■ . ' ■'■ , ■ ' 

implementation. In jprocess block 90S, the hand detection region 105 is positioned based 

on this head position, a predefined distance below the top of the user's head. In the case 

* •' ' ' ■ " . ' . 

* pfFig. l, the predefined distance determines the position 0^ 
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along the y-axis. If the user's height cannot be measured^ or if the cluster reaches the 
border of the region of interest 102 (implyiiig tha.t the person extends beyond the region . 
of interest 102), theja the hand detection region 105 is placed at a predefmed height. 

In many scenarios, it can be determmed whether the user's left or right arm is 
associated with.each hand that is detected in the position calculation block 611 of Fig. 6. 
In process bbck 906» the position where the arm intersects a plane that is a predefined 
position in front of plane 1001 is determined. Typically, this plane is coincident to the 
hand detection region boundary indicated by, 1002. If no features are near this plane, but . 
if some features are foimd in front of this plane, then it is likely that those features ., 
occlude the intersection with that planed and the position of intersection may be assimied 
to be behind the occluding features. By shoilest neighbor distances between the features 
of Ihe blocjb, each intersection is associated wi& 

The position of the middle of the user's body and the bounds of the user's body 
are also found in process block 907. Typically, this position is, given evenly distributed 
features, the mean position of all the features in the cluster. If features are not expected to 

■ . ' ' ' ' " - - ' ' - ■ ' 

be evenly distributed, the alternative measure of the position halfway between the 
cluster's bounds may be used. - ; 

. In process block 908, the aimrdependent position found by pro^ 
conljpared to the body centric position found by process block 907- If the arm position is 
sufficiently offset (e.g., by greater than a predefined position that may be scaled by the 

cluster's: overall width) to either th6 left or.right of the body-center position, then it may 

'. ■" . ' ' .>'■•■ 

be implied that the source of the arm comes.from the left or right shoidder ofthe. tiser 

■■■.*■ " ■ ' ' ■. "i -■. ■. • ' 

101. If two hands are found but only one hand may be labeled as "left" or "right" with 
certainty, the label of the other hand may be implied. Therefore, each hand is labeled as^ 
"left" or "right" based on the cluster's stracture, assuring proper labeling in many 
s(^narios where both hands are found and the left hand position is to the right of the right 
hand position. . ' 

I . - . . J ■ " - 

' , ■ I 

If one handis identified by process block.908, then the hand detection region 105 
may be placed (by process block 909) so that all parts, of the hand detection region 405 

'r \ ■ ■ . ■ - ' ' " ■ . 1 

are within an expected range of motion associated with the user's haudl The position of 
the hand detection region 105 along the remaining axis may be biased towards the arm of 
the arm as defined by Eq. 8 (which is provided below). If process block 908 failed to 
identify the annj or if it is otherwise desned, the position of the hand detection region.105 
along the remaining axis may be positioned at the center of the user's body as found.by . 
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907. In scenarios where tracking of both hands is desired, the hand detection region 105 
may be positioned at the center of the user's body. 

. Process blocks 903, 906 and 909 each solve the position of the hand detection 
region 105 in one axis, and togettier define the position of the hand detection region 105 
witbin three-dimensional space. That position is smoothed by a dynamic smoothing 
process 910 by the same method used by component 615 (using Eq. 5, Eq. 6, and Eq. 7). 
However, a higher level of dampening may be xised in process 910. 

The snioothed position mformation output from the dynamic smoothing process 

910, plus predeJBned size and orientation information 911, completely defines the bounds 

■ ' . ' ' . , ■ .. '■ ', ' 

of the hand detection region 105. In solving the position of the hand detection region 

- , "t-— , ' ' . ■ ■ 

-\ , . • ■ - ■ 

10s, process blocks 905, 907, apd 908 find a variety of additional body position measures 
913 (610 ofFig. 6) of the user. ; - 

\ In summary, the above implementation described by Fig. 6, using all the optional 
components, including those of Fig. 9, produces a description of person(s) in the scene 
(represented as the hand/object position information 205 of Fig. 2) that includes the 
following ijodformation: \ / 

- Presence/absence or count of users 
7 For each present user: 

o Left/Right bounds ofthe body or torso . 
o Center point of the body or torso 
. o Top ofthe he^ (if the head is, within the region of interest) 

1 ■ , , ■ ' !.,•'■■■> -1-1 . . 

p For each present hand: 

■ The hand detection region 

■ Alabel(tf"l^ffV*T[iight^^ • 

■ The position ofthe tip ofthe hand 

■ The orientation ofthe hand or forearm 

Given improvements in the resolution of the scene description 203, 
implementations described here may be expanded to describe the user in greater detail 

- • . ' ■ ' ..^ 

(for example, identifymg elbow positions). 

This hand/object position information 205, a subset of this information, or further 
information that may be implied from the above information, is sufficient to allow the 
user to interact with and/or control a variety of application programs 208. The control of 
three applications is described in greater detail below. 
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Through processing the above information, a variety of human gestures can be 
... detected that are independent of the application 208 and the specific control analogy 
described below. An example of such a gesture is "drawing a circle in the air" or 
"swiping the hand off to one side". Typically, these kinds of gestures be detected by tiie 
5 gestwe analysis and detection process 209 using 

■ ; '205, ' ■ • ' / • - • ' 

A large subset of these gestures may be detected using heuristic techniques. The - 

* * ■ • 

detection process 209 maintains a history of the hand and body positions. One approach 
to detecting gestures is to check if the positions pass explicit sets of rules. jPor example, 
10 the gesture of "swiping the hand off to one side" can be identified if the foUovwng gesture 
detection rules are satisfied: . ■ ^ 

. 1. The change in h9ri2ontal position is greater than a predefi^ 

:. span that is less than a predefined limit . ^ . 

• ~ . '. . - ? ' . ' ' . ' 

15 ' 2. The horizontal position changes nabttotonicaU^ 

3; The change in vertical position is 1^^^ 

. ' span. '«'• 

.4. The position at the end of the time span is nearer to (or on) a bor^^^ 

. • ." ■ " ' ' ' . ' 

detection region than the position at the start of the time span, ' 

20 

1 Some gestures require that multiple rule sets are satisfied in an explicit order^ 

whereby the satisfaction of a rule set causes the system to change to a state where a. 
, different rule set is applied. This system may be u^^ 
which case Hidden Markov Models may be used, as these riiodels still allow for chains of 
25 specific motions to be detected, but also consider the overall probability that the nibtions 
sufficientiy fit a gesture. 

An implemeiitation of this systeni provides a inethod of user interaction whereby 
the user causes a representation of an indicator to move within an image (user feedback 
206) that is presented to the user on a video display 108. The indicator is made to move 

r *.'<' I ' *-- , 

.30 ' in a way that reflects the movements of tiie user's h 

In one variation of tMs form of user interface; the indicator is shown in firoiit 

other graphics, and its mpyeinents are mapped to the two dimensional space defined by 
; the surface of the video display screen 108, This forrn of control is analogous to that 
provided by a mouse commonly used with desktop coinputers. Fig. HA shows an , 
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example of a feedback image 206 of an application program 208 that uses this style of . 
control. 

-' . - ■ , ' • - ■ , .. • ■ » . • 

. The following describes a method by which, in the position mapping process 207, 

a hand position 205; detected by the scene analysis process 204 as previously described, 

- " * ~ • ' 

5 ' . is mapped iiito the position wh^ the screen poiilter or cursor llOi is overlaid onto the 
screen image 206 presented oii the video display 108. Wheri one hand is detected and 
found to be within the hand detection region 105, then the band position 205 relative to 
the hand detection region 105 is mapped by the position mapping process 207 into 
coordinates relative to the video display 108 before it is conveyed to the application 

10 .program208^ One method of mapping the coordinates is thr applicationof Eq; 9 

(which'is shown below) for the x coordinate and the equivalent for the y coordinate. As 
/illustrated in Fig. IIB, the entire display region 1102 is represented by a sub-region 1103 
. . contained entirely within the hand detection region il04 (^alogpus to hand detection 
region 105). Positions (for example, hand positipri 1105) within the sub-region 1103 are 

15 > linearly mapped to positions (for example, 1106) within the display region 1102. 
Positions (for example, 1107) outside the sub-region 1103 but s^^^ 
detection region 1104 are mapped to the nearest position (for example, 1108) on the , 
. border of the display region 1102. This reduces the likelihood of the user unintentionally 
removing the liaiid frora the sub-region 1103 wliile attempting to move the cursor 1101 to 

20 a position near a border of the display. If both of the user's hands are detected within the 
, hand detection region 105, then oiie hand is selected in ppsition mapping process 207. 
, Typically, the hand that is reaching furthest mto the hand detection regioii 105 is selected. 

That Imd is detectable as the Imd that has, depending on the ^ 

■ I ■ ... • ' ■ ■ • ' . ■ ' 1 • ■"• . 

system and the definition of the world coordinate system 106,^ e 
25 smallest X, y, or z coordinate value. ^ 

-• <. ■ • - • ' . " . - ' ' "■ ,' * . ' 

■ An application that uses this style of interaction typically presents-graphic 

representations of data or controls (for example, a button 1109). The user is expected to 

cause the indicator 1101 to be positioned over one of these objects. This condition may 

be detected by coraparing the remapped indicator position 1 106 to the bounds (for "... 

30 example, lilO)of the graphic representation of t^^^ 

the indicator position is Within the object bounds. The user optionally receives feedback 

/ indicating that the cursor is positioned over an object. Feedback may be of a variety of 

■ forms, including an audio cue and/or a change in the graphicial representation of either or 

' .J . ' ■ , . ' ■ . ■ . 

26 
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both the cursor imd the object The user then may activate, mampulate, or move the 
object that is under the cursor. 

The user is expected to indicate his intention to activate, manipulate, or move the 
object by performing a gesture. In the implementation of this system presented here, the 
5 gesU^e analysis process 209 identifies as gestures patterns m 

position or other positions and measures provided by either or both of scene analysis 
process 204 and position naapping process 207. For example, the user may indicate an 

* • 

intention to activate the object that is imder the cursor is to cause the cursor to remain 
over the object for longer than a predefined duration. Detection of this gesture requires 
10 that the state 210 of the application, in particular the bounds and/or state of the object, be 
fed back into the gesture analysis process 209. The application need hot be created 

specifically for this system, as techniques exist that can unoblxusively monitor an - 

*i ■ i» ■ ■ ' " ' - 1 • 

application's state 210 and, using the coordinates provided by the position mapping 
process 207, emulate other interface devices such a computer mouse. 
15 In some scenarios, the application State information 210 

. rnay nbtbemomtored. In tins case, gestures t^ 

object under the cursor mclude holding the hand stationary ("hovering"), or polcing the 
hand quickly forward and back. 

A methodby which "hovering" has been detected is by keeping a history of the 

* * " "t ■ • ~ » . ■ ■ ^ 

20 position of the hand, where that history contams all records of the hand position and state 
for a predefined duration of time that ends with the most recent sample. That duration 
represent the ndnijmium duration that the user.must hold . 

. niinimumandniaximumposition, separately m each of the three .(^^ yi, z) dimensions, is 

, ■ . ■ ■ . ■• ' 

found within the history. If the hand was present in all samples of the history, and the 
25 distance between the minimum and maximum! is within a predefined threshold for each of 
the three dimensions, then Ihe "hovering" Those distaiice thresholds 

represent the maximum amount that the hand is allowed to move, plus the maxunum 
amount of variation (or "jitter") expected to be introduced into the hand position by the 
various components of the system. The typical method in which this gesture is reported, 
30 where the system is emulating a mouse as described above, is to emulate a mouse "click". 
Gestures representing additiohal operations of the mouse, "double clicks" and "dragging", 
have also been detected and those operatioi^ have been emulated 

In addition, gestures that are independent of the position of the indicator relative 
to an object may optionally be detected and given meaning by the application, either with 
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or without regard to the application' s state, An application that uses this style of 

.interaction typically does not explicitly use or display the user's hand or other positions. 

_ ■ ■ ' 

These apphcations can be wholly or priinarily controlled with only the interpretations of 
the positions made by this system. These applications also need not be created 
6 specifically for this system because the mterpretations made by this system can be used to 
simulate an action that would be performed on a traditional user input device, such as a 
keyboard or joystick. \ , 

Many usefid inteipretetions depend directly 0 
within the hand detection regibn 105, One method of making these interpretations is to 
10 defme boxes, planes, or other shapes. A state is triggered on if the hand position is found 
to be within a first box (or beyond the border defined by the first plane), and had not been 
in the immediately preceding observation (either because it was elsewhere within the 

■ -■ ■ 1. ; " . , . . ■ ■ ■: 

..hand detection regioii 105, or was not detected). This state is maintained until the hand 
position is not found to be withih 

.15 .. second plane), at which time the BtaX<^is triggered off. The second box must contain the 
entire first box, and, in general, is slightly larger. The use of a slightly larger box reduces 
occunences of the state.unintentionally triggering on and off when the hand position is 
held near the border of the boxes. Typically, one of three methods of interpreting this 
state is t^sed, depending on the intended use of the gesture. In one method, the gesture 

20 directly reflects the state ysdth an on and off trigger. When emulating a keyboard key or 
joystick fire button, the button is "pressed" when the state is triggered oiii and "released" 
when the state is triggered off. In the other common method, the gesture is only triggered 

t ' •'■*.*. ' - ■ . . • .. f- ■ 

: ;by the transition of the State ftom off to on. When emulating akeybpard key or joystick , 

■' ■ . ■ ■ ' • ■ ' ■ •' '" ■ • ■ 

button, the key is "clicked". Al&pugji the duration and off state are not reported to t^^^^ 

.1, ♦ ' ■ . . • • • ' 

25 ^ application, they are ihaintamed so that the gesture Avill not be repeated until after the 

state is triggered off, so that each instance of the gesture requires a clearly defined intent 
by the user. The third method typically employed is to trigger the gesture by the 
transition of the state from off to on, and to periodically re-trigger the gesture at 
predefined iritervals so loiig as the state remains on. This emulates the way in which 

30 holding a key down on a keyboard causes the character to repeat in soine applications. 

1* ' - > I 

» _ I ■ ... 

One way m vvhich boxes or planes for the above techniques inay be defined within 
the hand detection regioii 105 is as follows. By defining a first plane (1501 in Fig. ,15A) 

. and second plane 1502 tiiat divides the hand detection regionl05 into "fire" 1503 and 

• . - , . . . • ' . 

"neutral" 1504 regions (the gesture reported when the hand is in the region 1505 between 
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the planes depends on the previous positions of the hand, as described above), the above 
technique can deteqt a hand "jabbing" forward^ which is one gesture for eniuiating a fire, 
button on a joystick, or causing the application to respond in a way that is commonly 

pressing of a joystick button (for example, the firing of a weapon in a ' 

5 videogame). 

Another way in which boxes or planes for the above techniques may be defined 
within the hand detection region 105 is as follows. Planes of the first type 1506, 1507, 
1508, 1509 are defined that separate each of the left, right, top and bottoni portions of the 
hand detection region 105, overlapping in the comer regions as illustrated in Fig. 15B. 
10 Planes of the second type are labeled as 1510, 1511, 1512, 1513. Each pair of first and ' 
second planes is processed independently. This combination of planes emulates (he four 

• • , ' ' , * . - ' ' ' ' ■ ' 

directional cursor keys, where a hand in a comer triggers two keys, commonly interpreted 
' by matiy applicatioiis as the four secondary 45 degree (diagonal) directions. _ ' - . 
Referrmg to Fig. 15C, an alternative method is show^ 

15 discrete directions and applies for applications that expect the four 45 degree direction 
states to be explicitly represented. . Boxes 1514, 1515, 1516, 1517 are defined for each of 
the four primary (horizontal and vertical) directions, with additional boxes 1518, 1519, 
1520, 1521 defined for each of the secondary 45 degree (diagonal) directions. For clarity, 
only boxes of the first type are illustrated. A gap is placed between these boxes. . Fig. 

20 , 15D illustrates how neighboring boxes are defined. The gap between boxes of the first 
type 1522, 1523 assures that the usesr intentionally enters the box, while the gap 1524 is 
filled by overlapping boxes of the second type 1525, 1526, so that the system will report 
to previous gesture until the user was clearly intended to mo^^ 
This combination of buttons can be used to emulate an eight-directional joystick pad, 

25 A wider class, of gestures depend on motion instead of or in addition to position; 

An example is the gesture of "swiping the hand to flie left". This gesture inay be used to 

M ' - -* 

convey to an application that it is to retmi to a previous page or state. Through emulation 

■ - ' ' I ' ■ 

of a keyboard and niouse, this gesture causes presentation soilw£^^^^ 

■ ' - ■ 

PbwerPoint, to go to the preyious slide of a presentation sequence. Through emulation of 
30 a keyiipard and mouse, this gesture causes a web browser to perform the action associated 
with its "back" button. Similarly, the gesture of "swiping the hand to the right" is a' : 

gesture that niay be used to convey to an application that the user desires.to go to the next 

-• . ■ .■ ■ • , ■ ■ ■ ' ■ I ■ .. ■ " , ■ I ' ' 

paige or state. For example, this gesture causes presentation software to go to the next 
slide of a prcsentation sequence, and causes browser software to go to the next page. . 
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• Using the method of dividing the hand detection region 105 into regions, by 
separated planes, a method for detecting the "swiping the hand to the left" gesture that is 
simpler than that presented earlier is as follows. A thin stripe along the leftmost part of 
the hand detection region 105 is defined as the left-edge region. The hand position is 
5 . represented as the following three states: 

. " ■ ••' 

1, The hand is present and not inside the left-edge region 

2, The hand is present and inside the left-edge region 

• ' - . . ' ■ .. 

3, The hand is not present vvithin, the haiid detection region 

tP. ■ . ' ', ' ' ■ • ■ ■ . ■ 

r - ■ , 

A transition from state I to state 2 above causes the gesture detection process 209 
to enter a state whereby it starts a tinier and waits for the next transition. If a transition to 
state 3 is observed within a predeteriiuneddiiration of time, the "swiping the hand off to 
the left" gesttire is rejported to have occurred. This technique is typically dupUcated for 
15/ ; the right, upper, and lower edges, and, because the hand po 

dimensions, alsfo duplicated to detect "pulling the hand back". All of the above gestures 
may be detected using the position of either the head or torso instead of the hand. 

In another vadation of this system, the user causes a repfesentat^^ 
; indicator, or two indicators (one for each hand), to move within a representation (user 
20 feedback 206) of a three-dimensional virtual enviromn The feedback may be 

' provided by stereoscopic means whereby each of the user's eyes view a.unique image 

' ' ■ •- - " ■ ■ ■ , ' • . , _ . _ • " - _ . , . , , , " • 

; . creating an illusion of depth, although tHs type of system is impractical in many \ . V 
'\ scenarios, and is therefore optional. It is otherwise possible, however, to imply the depth 
of objects by rendering the virtual enviromnent using projective transforms. An example 
25 of use of this type of rendering is provided in Figs. 12A, 12B, and 13A. ■ 

Refening to Fig. 12A, the foUowing describes a method by whichi in t^^ 
mapping process 207, a hand position 205, detected by the scene analysis process 204 as 
.. previously described, is mapped into the position where the indicator 1201 is positioned 
within the vhtual environment. Hand position(s) 205 relative to the hand detection region 

■ - " . ' , . - * " • . " ' ^ 

3d 105 are mapped by the position niapping process 207 into coordinates relative to the / 
video display 108 before beiag conveyed to the appUcation program 208. One method of 
. mapping the coorchnates is through the application of Eq. 9 for the x coordinate and die 
equivalent for the y and z coordinates. This is similar to the method described previously, 
except that a third dUnehsion has been added. 

30 - - " . ' 
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Given the ability of tiie user to manipulate the position of the indicator 1201 in all 

three dimensions, the user 101 may cause the indicator(s) to touch objects (for example, 

1202) witto the virliial envux)nment lik^ This is one 

method of user interaction with a virtual environment. The bounds (for example, 1203 

and 1204), which may be represented as a cube or sphere, of the indicator and object are . 

compared. The condition where the two bounds intersect indicates that the indicator is 

touching the object. It is possible, given well laid out objects, for the user to cause the 

indicator to move to a position that *touches" an object, where the path of the indicator 

' . • .)'.'■ 

avoids "touching" any other objects. Therefore, a 'touch" generally signals the user's 

intention to activate, maniptdate, oir move the object Therefore, unlike two-dimensional 

control, three-dimensipnal control of the indicator 1201 eUminates the need for an explicit 

gesture to initiate one of these actions. Also, unlike two-dimensional control, objects niay 

be laid out at different depths (as are the file folders in Fig. 12A), to provide an interface 

that is a closer analogy to actions that the user may be familiar: with performing in the real 

world. In addition, gestures tiiat are indepeiident of the position of the indicator 1201 

relatiye to an object may optionally be detected to iiidicate the intention to perform an 

action. ■ • ' . 

It is possible for the user to navigate within.a virtual environment using this 
system. Navigation allows the user access to inore objects or information than may be 
represented in the user feedback 206 at one time, by allowing the user to cause the 
selectionpf a subset of the objects or information to be represented. Navigation may 
optionally be of a form whereby the user 101 roams within a virtual environment and the 
. subset of objects or mformiation available to the user is dependent on the user's positions 
within the virtual environment. An example is presented ui Fig. 13A, where the user may 
roam vvithin the virtual room to reach any of several collectiom of 6^^ 

' ■ ' - - . ' ; ' ■ '■ 

represented as filing cabinetsi / . 

Next, a method by which the user roams within a virtual 
The video display image 206 is rendered in such a way that it represents the virtual 
environment as viewed by a virtual camera, whereby any obj ects within the field of view 
of the virtual camera, and not occluded by other virtual objects, are presented to the user. 
In one option, referred to as "first person", the position of the camera represents the 
position of the user within the virtual environment. In another option, an indicator 
: represents the position of tiie user within the virtual environment. This indicator may 

■ * " * ' < ' . ' ' ' • ' 

optionally be an avatar (presented on the video display 108) that represents the user.lOl. 
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. The virtual camera position is caused to follow the indicator so that the indicator and all 
objects accessible to the user firom the current user position are within the virtual 
camera's field of view. 

Either the user's hand, body or head position may affect the user's virtual position 
5 . when roaming. A position representing the center of the user's torso or the top of his 
head is foiuid in. some implementations of this system, in particular those 
: implementations in which the optional gestiire analysis process 609 is performed in its 
. entirety as outlined by Fig. 9. The use of either of these positions allows the user 101 to 
perform the action of roaming indepeiadently of the position of his hands, pei^ 
.10 hands to be used to "touch" virtual objects while roaming* Note that these touchable 

» ■ ' - * - ■ * . **' 

\ objects may be fixed iii position relative to the virtual environment, or fixed in position 
. . relative to the virtual, camera and therefore always available to the user. If these positions 
are not available, or it is otherwise desired, the user's hand position may be used to 

control roaming. In this case, the system may autoinatically switch to the touch context 

< .. . ■ ■ , . - ' ' . ' " ■ ■ ■ '• ■ ■■ • • " ' .'■ 

15 when the \iser has roamed near touchable 

'. ■ * ■■ . . ' ' ' ' ■ > 

.. gesture. , ■ - V ' - . i,- ; . / 

To provide a region where no change to the virtual position is implied, called a 

dead zone, the position (either handi torso, or head) may be remapped by application of 

Eq. 10, (and similar equations for the y and z coordinates), which results in the . 

20 relationship illustrated by the graphs in Fig. 13B. Note that the bounds and lieutral 

> position inay be coincident to the hand detection region 105 and its center, or another 

region that is dynanucaUy adjusted to accommodate the u^^ 

When the torso or head is used, the bounds and neutral position, as used in Ed; 10 

may be adjusted to accommodate the user as follows. Fu:st,"the neutral position Xc, yc, Zc 

25 ' ' used in Eq. 10 may correspond to the lieutral position of the user's body. All users, after 

approachiiig the system, may not stand in the exact same location. After the user 101 has 

been given time to enter the region of iiiterest 102, the user's torso or head position is 

■'r sampled and used as the neutral position. The maximum range of motion^ that is the 

. distance in which a user is expected to comfortably rnove (or "lean") in each axis, is 

30 predefined. To assure that the user rernauis within the region of interest 102 while ' 

■ ' ' , ' 

moving to these extreme positions, the neutral position Xc is boiinded to within the region 

of interest 102 by a Tninimnrri of one half of the maxhnum range of motion described 

above, plus one half the typical body size, in each of the x, y, and z dimensions. ,The 
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bounds b| and br are placed relative to the neutral position, with each being one half the 
maxmum range of motion from the neutral position. 

Gestures, as discussed earlier, may be based on the position and/or motion 0^ 
head or torso instead of the hand. In this case, the region defined by these bounds is used 
instead of the hand detection region 105: \ ' 

Horizontal motions of the user (along the axis labeled x in the example of Fig. 1) 
cause the view of the virtual environment to look left or right The hori2X)ntal position, 
transforme4 by Eq, 10, is applied as a velocity functipii onto rotation about the virtual 
vertical axis^ causing the indicator and/or camera to yaw. It is optional that vertical 
motions of the user (along the axis labeled y in the example of Fig. 1) cause the virtual 
view to look up or down. The vertical position, transformed by Eq. 10, is interpreted 
dkectiy as the angle of rotation about the horizontal axis, causing the indicator and/or 
camera to pitch. Motions of the user 101 to or from (along the axis labeled z in the 
example of Fig. 1) the display cause the virtual position to move forward or backwards. 
One style of motion is aualogous to "walking", where the indicator and/pr camera 
remans a predefined height above a virtual "floor- ', and follows any contours of the floor 
(for example, move up a set of virtual stairs). The trai^sformed positioii is applied as a 
velocity onto the. vector that is the projection of the indicator and/or camera's orientation 
onto the plane defined by the "floor". Another style of motion is analogous to "flying". 
If this is desired, the transformed position is applied as a velocity onto the vector defined 
by the indicator and/or camera's orientation. An example of a virtual envirpmnent, which 
is navigated by the "flying" rnethod of control as described, is shown in Fig. 14. The 
user's torso position, found by the methods described earlier and using the mapping of / 
Eq: 10 and adaptive neutral position as described previously, is used in this example. 

: The indicator used in the virtual enviroiiment, whether or 
which the user controls or roams in the virtual environment is utilized, may take the form 
of an avatar. An avatar typically takes the form of a human-like body, as in 1401 of Fig. 
14. The positions found by this system provide sufficient information to animate the 

r 4 • _ .» - , , 

.virtual human-like form. • 

TWs systein finds both of the user's hands when they arevwthin t^^ 

■ ' . .> . , . , 

region lOS. These positions are remapped to conesppnding positions in front of 
avatar's torso, allowing the avatar's hands to reach to the same positions as the user is 
reaching to. A user's hand is not found or selected when the hand is not witiiin tiie hand 
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detection region 105. In this case, the avatar's corresponding virtual hand may be moved 
to a neutral position along that side of the avatar's body. 

In implementations of this system that utilize **roaming", a control position is 
foxmd relative to a neutral position. In these implenientation, the avatar's feet may remain 
5 in fixed positions and the relative control position is used direcfly to determine the 
position of the avatar's torso over the fixed feet (the stance). Fig. 14 shows an avatar 
controlled in this manner. In implementations not using "roaming'', the avatar 
position may be determined directiy by the position representing the center of the user's 
torso or alteniatively a position relative to the top of the head, as found in optional 
10 component 609. 

Additiond details such the positions of secondary joints niay be found 
inverse kinematics techniques. In particular, the orientation data 613 associated with the 
forearm can be used to constrain the inverse kinematics solution to position the elbow to 
, ' be near to the region firom which the forearm originates within the hand, detection region 

15 105. The orientation data 613 constrains the elbow to a plane. , The elbow' s position on 

■"■ . . ' • . " ' . ' ' ' '.' . . .-'^ ' 

that plane is determined as tiie intersection of the arcs, with radii representing the length 
of the.avatar's upper and lower arm segments, one centered on the avatar's hand position 
(in the virtual environment) and the other centered on a position relative to the avatar's 
torso representing the shoulder, Similarly the avatar's knee positions may be detenhined 
20 by the application program. By placmg the avatar's feet in a fixed position and assuming 
the avatar's ankles caimot twist; the plati<e iii which the knee bends is also fixed, and the 

r t ' ' ' J ' 

knee position:i>determined by a similar intersection calculation as the el^^ 

Moreover, using the fixed foot position, the position of the avatar may be calculated such 

-■• ■ . ■ ' ' , •". ; ' ■ ' . - 

that the avatar appears to lean m a desired. direction. .With these calculations, the 

25 ' positions ofthe avatar's torso, hands, elbows, feet and Imeesa^^ 

sufficient to animate the avatar. 
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Eq. 1 



■10 



Eq. 2 



EQUATIONS 



D 



where / is the inter-cameTa distance 

D is the disparity 

X is the image position 

X \s the world coordinate position 



Y (^^/sina) + {lycosq) 
D 



15 



20 



where / is tiie inter-camera distance 

D is the disparity 

/ is the average focal length 

J is a unit-conversion factor applied to the focal lengdi 

a is the angle of tilt between the cameras and the world cooiidinate 

"is^axis .... 

. J/ is the image position 

Tis the world coordinate position . 
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Eq.3 



.2 = 



_ (sFI cos a) -¥ {1}^ sin a) 



D 



10 



^^e^e / is the inter-camera distance 
Z) is the disparity 

-P is the average focal length 

< " ' . ■ 

J is a unit-conversion factor applied to the focal length 
a is the angle of tilt between the cameras and the world coordinate 
z-axis . ' ■* - ' ' • • 

z is the image position • . , 

Z is the world coordinate position 



1 • ' I 



Eq.4 



15 



region 



,20 



w- 



0 



: if d>id,-d,) 

otherwise 



where w is the weight, measured 0 to 1 

d is the distance of the feature into the hand detection 



d^is the distance of the feature that is fuithest into.the hand 

I , ... ' ' 

detection region 

• ' '" - ' "i, " 1' % » 

is a predefined distance representing the.expected size of 

J " I r - ' _ 1 

. ■ f ' . 

. ■ , " ' ' P i * 

the hand > . 
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Eq.5 



S~ 



oSB + 0.-a)S^ wherea = 



if{,P^<D<D.) 



where D^jrCO-^Cr-l)! 

s(t) is the smoothed value at time t 

r(t) is the raw value at time t , • 

I)^ and are thresholds 

• ^ - ' - ■ 

Sj^ and jS^ define degrees of dampening 



1 



10 



Eq.6 



a = — where a is bound such thiat O^a^l 



16 



where S is dampening found .by Eq. 8 . ; . ' • . 
^ e is the elapsed tinie since the previous sample 
a is a scalar ; . 



Eq.7 



. 20 



r ■ 

■ i » 



,y(0 = (a X r (^)) + ((I - a) X f(f - 1) j 

where s(t) is the smoothed value at time t " 
r(t) is the raw value at time t 
a is.aicalar wh^ 0 ^ a < i 
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Eq.8 



10 



^K-^j3(bj-bc) if left' arm ^ 
PiK^^c) if right-arm) 
: if ^ unknown 

where x is the position of the^ hand detection region 

I ■ 

6^ is the position of the body's center 
6; and b^ are the positions of the left and right bounds 
of the body 

7? is a scdar . representing the amoiuit by which the 

hand detection region position is biased to the, 
left or right side 



1.. ■ 



. Eq . 9 



16 



6 



1 



if Xh <b, 
if - b,^x,<b, 

where is the hand position in the world coordinate systeni 
x^is the cursor position on the screen, mapped 0-1 

' t Wi * ' '~ ' ' ' ' • . ' ' ' ' 

biQJidbr are the positions ofthe left and right bounds 

of a sub-region within the hand detection region, 
> wxt; the world coordinate system 



20 
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Eq. 10 



-X 



m 



-X 



m 



0 



' » , - • 

if b,<x^<[x,-^) 



'c • 2 

if ixc+^)<xj,<b. 



if 



where X, is the velocity applied in the virtual coordinate system 
is. the maximum magnitude of velocity that may be 

i 

applied in the virtual coordinate system . 
; , Xf, is the position in the world coordinate system 

is the neutral position m the world coordinate sj^stem 

x^ is the width of the "dead zone" in the world coordinate 

'■■ system - , . ■. " 

/ 6/and 6^ are the positions/of the left and ri 

of a sub-region w.r.t. the world coordinate system : 



A number of implementations have been described. Nevertheless, it will be ■ . 

understood that various modifications may be made. Accordingly, other implementations 

1 • ■ ,■ ' - " . , ■ , - . • ■ . ' -I 

are within the scope of the following claims. ■ • ' .. ' • 
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WHAT IS CLAIMED IS: 

1 . A methpd of using stereo vision to interfiace with a computer, the method 
comprising: ... 

capturing a stereo image; 

processing the stereo image to determine position information of an object 
in the stereo image, the object being controlled by a user; and 

using the position information to allow the user to, interact with a computer 

application. . 



2: The method of claim 1 wherein the step of capturing the stereo image 

■ - 'I •- • . , ■ > • ' ' ' 

10 further includes capturhig the stereo image using a 

. • • . ■ - • • : • . ' 

' ■ - .• . ' • - . ■ • ■•■ • - , . '. " ^ "■ ' . ; ' : 

■" ■ • . - , . . ..... \ . '■ . ■ '. 

3. The method of claim 1 further including recognizing a gesture associated 

with the object by. analyzing changes in. the position information of the object, and 
'. . ' .' ■ ■ , ■ " ■ ' ■•■.■••( ■', * 

controlling the computer appUcation based oh Ihe recognized g 

15 " ' . /■ ■ , \ • ■ ' ' ■ ■ '"^ ■"' ' 

.4. Themethodof claim 3 further including: 

,, ■ . - . • ' ■_..<>_ ■' ■ 

/ determining ah application state of the computer appU^ . .. 

using the application state in recognizing the gesture. 

■ . !■- J' " . 

* * - " ' , ' - . ' 1 • 

'r -I • 

20 5. . The method, of claim T 

• . ' ' • . .'• ' ■ ' . . ' - ■ . " '■' 

. • ' ' ' ' • " ' " ' ■ ■ ' ' ■ : • ■' ■ . 

i - " • -, • . ■ • . ; - ■ ■ ' _ ■ ■ 

6. Themethodof claim 1 wherein the object is a part of the user^ 

7. The method of claim 1 further including providing feedback to the user : ■ 
25 relative to the computer application. 

: '" 1 ' ' ■ . - " 

* * - * ■ ■ 

- . • • ' ,j » " ' ' 

8. , : The method of claim 1 wherein processmg the stereo iniage to determine 

, position information of the object further includes mapping the position inforniation from 
position coordinates associated with the object to screen coordinates associated with the 
30 computer application. 
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9. The method of claim 1 wherein processing the stereo image further 
.includes processing the stereo image to identify feature information and produce a scene 
: description ftpm the feature information. 

5 10. The method of claim 9 further including analyzing the scene description in 

a scene andysis process to detemiine position information of the object. , 

" ' - ■- , . . . ■ t - . 

. 1 L . The method of claim 9 , wherein processing th^^ 
includes; . - • 

10 analyzing the scene description to identify a change in position of the : 



object; and 



mapping the change in position of the object. 



- , 12. Tiieiiiethodof claim 9 wherein processing the stereo image to produce the 
15 scene description further includes: ; 

processing the stereo image tb identify niatchmg pairs of ^ 

' , ' ' ' * * " " , ■ ' - f ' 1 ' ■ ' ' ' 

■ stereo image; and . , j 

, . ' * 1 ' ■ ' . . . . 

. calculating a disparity and a position for each matching feature pair, to 

' ' . ' . , ' » i . ■ " ' ■ ■ ' ^ 

. ». ' " ' - ■ - ' ■ '- * « ' ' ' ' 

create a scene description. . , 

20^ ■ 'r . ■ . ■ ' . ' ■ ■ ' i ' 

, 13. The method of claim 12 wherein: ■ ' 

. capturing the stereo image further includes captyi^ 
from a reference camera and a comparison image from a comparison . 
; . ^ processmg the stereo image ft^ 
25 . image and the' comparison unage to create pairs of features. / 

r ^. ■ , . ' 1 ' \ . ~ - ' 

r * *' .. - ' ' ' ' 

- ' * * ■ ' _' . ' , ■ r " ' ^ ■ . , ' ' 

.. 14. Themethod .of claim 13 wherein processing the stereo.iiiia^^ 
matching pairs of featoes in the stereo image further 

identifying features m the reference image; 

30 generating for each feature in the reference image a siet of candidate 

■ ■ ■ ' . ' ' ' ■ ' • ' 
matclung features m the comparison image; and 

^ ■ 1 * ■ " 

. • I - ■ . . 

; ' . producing a feature pair by selecting a best matching feature from the set 
. of ciuididatematclmg features for each feature in the refere^^^ 



1 •■ , 
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• ' ' ' . ■ ■ " 

* 

15. The method of claim 13 wherein processing the stereo image further 

-* ' ' - ■ . 

includes, filtering the reference image and the comp^^ 

' . ' .. ■ . ' ' ... 

16. The inethod of claim 14 wherein producing the feature pair further 
includes: , 

calculating a match score and rank for each of the candidate matching 

features; and 

I- 

selecting the candidate matching feature with the highest match score to 
produce flie feature pair. 

* ' ■ . * * 

17. liiemethodof claim 14 wherem generating for each feature in the 
reference image, a set of candidate matching features further includes; selecting candidate 
inatching features from a predefined range in the comparison image, 

.- . - -■ i ' . ■ - ; - ■ 

I . ( • . 

■ - -1. '* , 

t * , ' ' . : .. ' ' • . • 

18. The method of claim .16' wherein feature.p 

• .' ' ' '■- ' . ' • . ■ 

the match score of the cmdidate roatching feature. 

. ' . , - ■ f ., ■ ■ " • , " ■ 

t - . • - . ■ . .. ■ ' , 

19. The method of claim 18 wherein feature pairs are eliminated if the match 
score of the top ranking candidate matching feature is below a predefined threshold. 

• [ " » »» , r ' 

' .. .' ' ■ • -^r 

' . t - I ' ■ • ' 

- * ■ -'^ 

f ■ I » _ ■ . ' ' 

' ' . * . _ ' • t * 

20. The mbthod of claim 18 wherein the feature pair is elirmnated if the match 
scoraof the top ranking candidate matching feature is witMn a predefined threshold of the 
match score of a lower ranking candidate matching feature. ^ ' 

' " -» • * ■ , ■ • . . • 

21. Ilie methodof claim 16 wherein calculating the match score fu^ 
includes: •■ ■ '. " "'. 

identifying liiQse feature pairs that Ve neighborin 

adjusting the match score of feature pairs in proportion to the match score 

of neighboring candidate inatching features at siiiiilar disparity; and. 

selectuig the candidate matching feature with^^^^ 
scbre to create the feature pair. 
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22. The method of claim 16 wherein feature pairs are eliminatftd hy 

applying the comparison image as the reference image and the reference 

image as the cbmpanson image to pbduce a second set of fea^ 

eliminating those feature pairs in the origmal set of feature pairs which do 

not have a corresponding feature pai?: in the second set of feature pairs. 



. 23. The method of claim 12 further comprising: 

for each feature pair in the scene description, calculating real world 
coordinates by transforming the disparity and position of each feature pair relative to the 
real world coordinates of the stereo image. - . ' 



24. Themethodof claim 14 wherein selecting features fl^ 

dividing the reference image and the comparison image of the stereo image into blocks . 

■' - ' ■ fi ; ■■ . ■ •■ ' 

25. The method of claim 24 wherein the feature is described by a pattern of 

luminance of the pixels contained with the blocks. 



• -. ' ■ ■ • . ■ " ' •* 
26. The method of claim 24 wherein dividing further includes dividing the 

images into pixel blocks haviiig a fbced s^^ 



27; ' Tie method of claim 26 wherein the. pixel blocks a^^ 



28. The method of claim 10 wherein analyzing the scene description to 

determine the position information of the object further includes cropping the scene 

..... • ■ ■ . . ' - ' ' - . ■ 

description to exclude feature information lying outside of a region of interest in a field of 

. view. ■ . ■ " ^ ' 



. 29. The method of claim 28 wherein propping further includes establishing a 
boundary of the region of interest. ' 



30. The method of claim 10 wherein analyzing the scene descriptipn to 
determine the position information of the object further includes: 
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> ..." ' . * -J t ■ ■ - t " 

, . '" ■ , , • 

■ I" . . _ ■ . 

clustering the feature infonnation in a region of interest into clusters 
having a collection of features by comparison to neighboring feature information within a 
predefined range; and , 

calculating a position for each of the clusters. 

I ' • ■ ■ " ■ - • 

i ' - . ' . ' - 

■ 5. - " " ■ • ■ - - • ; ■■ , . ;■■ 

31. The method of claim 30 further including eliminating those clusters having 

' ■ . " .' ' 

less than a predefined threshold of features. . 

32. The method of claim 30 further including: 

10 ' ^selecting the position of the clxisters that mat , 

. recording the position of the clusters th^ 
object position coordinates; and , , , 
• outputting the object position coordinates. ■ , 



15 "33; IThe method of claim 30 fintherincludi^ 

user from the clusters by checking features within a presence detection region. . 

. ■ , ■ ■ ■ ■' . - '■ ■ ■ • " - . ' ■ 

' ■ . * . ' '- ,. ' 

... * I • • ■ 

34. Themethpdof claim 32 wherein cedculatmg the position ^^^^^ 
clusters excludes those features in the clusters that are outside of an object detection 

20 region. 

• 

■ ■ . . . . . . ■- . 

1.. .»•.■, ■ - , . . . ■ , • 

. . . ■ ■ ■ . ; . • / • , -. • " • 

35. The method of claim 32 further including defimng a dyi^^ - 

j' ' ' . I" > . -' • 

detection region based on the object position coordmates. 

25 ■ 36. The method of claim 35 wherein the dynamic object detectibn region is 

defined relative to a user's body. \ ' ' ' _ 

■ ■ " .... - ■ ' ■ " . ^- " / ■ ■ i ':■ ' 

. ' ■ • - • . .1 • - . • 

. > * I • • . . " 

- . ' • . ■ . " ' ' ! ' ' , ' ' ' '. ' ' . 

37. The method of claim 32 further including defining a body position : 
detection region based pn the object position coordinates. ; 



30 



38. The method of clahn 37 wherein defining the body position detection 
region further incluldes detecting a head ppsitioii of the user. 



44 
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1 ■ ■ - * " J 

* . ■ ' 't " ■ ■ J * . I 

I ' • ' ■ ' - , _ " ' 

. ' • It 

■ 39. The'method of claim 32 further includ^^ 

obj ect position coordinates to eliminate jitter between consecutive image frames. 

40. The niethodpfclaim 32 further including calculati^^ 

5 infonnation from the object position coordinates. ' . r 

, * ■ - - - (1 ■" ' 

41 . The method of claim 40 wherein outputting the object position coordinates 
ftoher includes outputting the hand orientation inf^ 

10 , . 42. The method of claun 40 finlherinclucUng smooth 

hand orientation information. 

■' • ■ ■ / , ■ ■ • ■ ■ ■ ■ ■ 

, , • . - . . - ■ - 

.43 . ■ Themethodof claim 36 wherein defining the dynamic object detection 

." ' ' ' ■ ' ' ' '■ ' . ■■ .- 

region includes : ; . , ' . - 

15 ^ identifying a position of a toirso-divisionmg plane fr^ 

features; and 

> determining the position of a hand detection region relative to the torso- 
divisioning plane in the axis perpendicular, to the torso divisioning plane. 

20 44. Themethodof claim 43 further including: . • 

identifying a body center po5ition and a body boundary position from the 
collection of features; : 

!. identifyingapositionindicatingpartof anarmof theuser fromthe - 
' coUection offeatures using the intersection bfthe feature pair clust^^ 
25 . divisioning plane; and 

identifying the ann as either a left ann or a right ann using 

. ' " » ' ■ - ■ . ' 

position relative to the body position. 

J- ■ - . * , ' ' 

♦ 

45. The method of claim 44 further including establishing a shoulder position 
30 . from the body center position, the body boundary position, the torso-divisioning plane, 
. '•■ and the left aim or the right arm identificatioii; . 



45 
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46.. The method of claim 45 wherein defining the dynamic object detection 
region includes determining position data for the hand detection region relative to the 
. shoulder position. 

5 : 47^ The method of claim 46 further including smoothing tlie positi^^ 

the hand detection region. 

• ' '. - - ■ ■ • * 

48. The method ofclaim 45 further including: ' 

determining the position of the dynanaic object detection region relative to 
10 . the torso divisioning plane in the axis perpendicular to t^^ 

determining the position of the dynamic object detection region in the 
' horizontal a^ds relative to the shoulder position; ari^ 

■ I ' ' ■ ■ ' ■ r 

detemiining the position of the dynanaic object defc^^ 

, ' ' ' ■ , . , ^ . . ' .' 

vertical a?ds relative to an overall height of the user using the body boundary, position. 

15 ' " ■ . , ' 

49. The inethpd of claim 36 wherein defining the dynamic object detectioii 
region includes: 

. estabUshing the position ofa top ofthe user's head using topnqiost feature 
pairs of the collection of features unless the topmost feature pairs are at the boundary; mi 

20 determiningthe positionof a hand detection region relative to the top of 

- . . . -. ■• ,- . _ • ■ , ■ • 

the user 's head. . . " 

> ' • ". ■ . ' , ■ ■■ ■ ■ ■ . , . ' ' • ' . ' - " 

'. ' ■ ■ > . . •■ ■ . ...... . , 

"50; A method of using Stereo vision to interface with a ^ 
comprising: " 

25 capturing a stereo image using a stereo camera; 

processing the stereo iinage to determine position information of an obj ect 

' ■ ' ' > . - . . ' ■ 

in the stereo image, the object being controlled by a user; • 

processing the stereo image to identify feature ixifonnation, to prbdu 
scene description fcom the feature information, and to identify matching pairs of features . 
30 in the Stereo image; 

calculating a disparity and a position for each matching feature pair to 

' ■ I . ' . \ . " . - ■ . 

create, the scene description; ■ ' v, 

analyzing the scene description in a scene analysis process to detern^ 

- . position information of the object; . 
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: - . ■ . ■/ . • • ■ . 

clustering the featoe information in a region 0 
. having a collection of features by comparison to. neighboring feature infomaation within a 
predefined range; - 

calculating a position for each of the clusters; and 

uising the position information allow the user to interact with a computer 
application. ' ,^ y 

' - I ' ' ' 

. ■ - '■ - 

51. . The metiiod of claim 50 further including:' , 

mappmg the position ofthe object from Ihe feature information &om . 
camera coordinates to screen coordinates associated with the computer application; and , 

using the mapped position to interface with the computer application. 



; 52. The method of cla^m 50 further including: : ' .. 

r. ' ' ' '' ' ' ' " ' - • 1 ■ 

recognizing a gesture associated \yith the object by analyzing changes in • 
the position infomation of ihe object in the scene ^^^^ 

combining the position information 
coniputer application, ' 

■ . • 1. " J, . ■ • ■ • • , ■ ' \ : ■ ' ■ 

■ 53. The method of clmm 50 wherein the step ofcapturing the stereo iinage 
ftirther includes capturing the stereo iniage using a sterep camera. ' 



54. A stereo vision system for interfacing w 

ona computer, the Stereo vision system comprising: , 

. . first and second video cameras arranged in an adja;cent configuration and 
■ , ■ "■ ' '.. ' = . ■ • " • ■ ' ■ ' 

operable to produce a series of stereo video images; and 

a processor operable to receiye the series of stereo video unages aind detect 

objects appearing in an intersecting field of view of the cameras, the processor executing 

aprocessto: . 

define an object detection region in three-dimensional coordinates 
relative to a position of tile first and second video cameras; 

select a control object appearing within fte object detection region; 

and 
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map position coordinates of the control object to a position 
indicator associated with the application program as the control object moves within the 
object detection region. 

55. Thestereoyisionsystemof claun 54 wherein the proc^^ 

control object a detected object appearing closest to the video cameras .and within the 
object detection region. . 

56. Ihe stereo, vision system of claim 54 wherein the control object is a himian 

hand. . 

1.' - . ; , - ■ - ■ ^ - . 

' • . - ' '■ ■ i " ' ' ' ' ' , ' 

'. . 57. The stereo Vision ^^^Xim of claim 54 wherein a horizontal position of the 
control object relative to the video cameras is mapped to an x-axis screen coordinate of 
the position indicator, 

; ' . . ' ' " ^ . ' • I - 

: ' . ■ • . "■' -■■ ^ ■ ■ •■■ ■ 

, i ' -. ■' <* ' ■ l" . " " ' 

58. The stereo vision system of claim 54 wherein a vertical position of the 

control object relative to the video cameras is mapped to a y-axis screen coordinate of the 

■ . . ■ • . "1 • " . '■ 

position indicator. , , ' ; : 

'1 ■ . - 

'■ ■ ■ - • 

. 59. Thestereo vision system of claim 54 wherein the processor is config;u^ 

-to: ■ . - ■• • • ; '■ . - - ■ . V • ■ - v ... , 

map a horizontal position of the control object relative to the video . 

cameras to .a x-axis screen coordinate of the position indicator; . 

- ■ . ' ' . ' ■ '' 

map a vertical position of the contior object rblative to the video cameras 
to a y-axis screen coordinate of the position indicator; and 

emulate a mouse function i]^gth& combined Xraxi^ 

. " " . - , . - .. * , 
coordinates provided to the application program. 

' ■ ■ ' . ' ' I ■ 

I I • ' " . . ' 

60. ' The stereo vision system of claim 59 wherem the processor is further 
configured to emulate buttons of amouse using gestures derived &om (he motion of the 
object position. ' ' , ; 
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. • ■ J ■ ■ . - ■• . ' ' ■ ■ 

61 . The stereo vision system of claim 59 wlierein the processor is fiirther 
configured to emulate buttons of a mouse based upon a sustained position of the control 
obj ect in any position within the object detection region for a predetermined time period. 

' , , " ^ . . -J I 

•• . _ . . -. /. •' ' - 

6 62. Thestereo vision system of claim 59 wherein the processor is fin^ 

configured to emulate buttons of a mouse based upon a position of the position indicator 

being sustained within the bounds of an interactive display region for a predeternained 

time period.. - 

I . ■ • * .*■■".* ■ 

■ * ' V 

* " - • ■ 1 , • * 

■ • , * 1 

10 63. The stereo vision system of claim 54 wherein the processor is further . 

■ ^ . - ' ' ' r 

configured to map a z-axis depth position of the control object relative to the video 

, t " ' I * . I . 

■ ' " * . ' " 1 * ' r' ' 

cameras to a virtual z-axis screen coordinate of the position indicator. 

64. \ The stereo vision system of claim 54 wherein the processor is further 

■15 configured to: , ' 

" ' - . ' ■ ;.■ '. '': ■ ..."?''[•' 

: : map a x-axis position of the contr6l object relative to the video cameras to 

an X-axis screen coordinate of the position indicator; 

■ • ' map a y-axisposition;ofthe control object rdative to 

a y-axis screen coordinate of 4e position mdicator; ai^^ 

20 / map a z-axis depth position ofthe.control object relative 

cameras to a virtual z-axis screen coordinate of the position indicator. 

65. Thestereo visioii system of claim 64 wherein a position of ^ . 
indicator being within the. bounds of an interactive display region triggers an action 

25 . within the application program. 

66. . The stereo vision system of claina 54 wherein movenient of the control 
object along a z-axis depth position that covers a predetermined distance within a 
predetermined time period triggers a selection action within the application program. 

67; ; The Stereo vision system of claim 54 wherein a positi^ 

(' ■ ' ' ' ■ . . ■ <■."'■". 

object being sustained in any position within the object detection region for a 

*• J. • 

predetenniiied time period triggers a selection action withm the application program. 
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- ' . ^ . , ■ l" ' 

» . - ■ . . " ■ 

68. A stereo vision system for interfaciiig wilt an app 
on a computer, the stereo vision system (k>mprising: 

first and second video cameras arranged in an adjacent cpn^ 
operable to produce a series of stereo video images; and 

a processor operable to receive the series of stereo video images and detect 
objects appearing in the intersiectmg field of view of the cameras, the processor executing 
aprocessto: 

define an object detection region in three-diniensional coordinates > 

■ ■ , . . ^ ■ .. - - . , • ^ . 

relative to a position of the first and second video cameras; 

■ ■ ■ • . ■ " .. ■ ; I- • 

select as a control object a detected object appeari 

. video cameras and within the object detectipn region; 

■ 1 * ' ■ . ■ - ■ 
define sub regions within the object detection region^^ 

identify a sub region occupied by die control object; 

associate with that sub region an action that is activated when the 

control object occupies that sub region; and 

^plytheactioritointerface witii a computer application. 

' •■ - ■ • - ■ .. V ■ - • - ^. 

69. . The stereo vision system of claim 68 wherein the action associated with 

the sub region.is further defined to be an emulation of the activation of keys associated 

• , . . . . ■ .' . '■ •' ■' . . • " " . ■ 

with a computer keyboard. 

\ - _ , 

■ '- ^ t - - • 

- "- * - * ' 

70. = The stereo vision system of claim 68 wheirein a p^ 
object being sustained in any sub region for a predetermined time period 
action. . ' ■ 

" ' - ■ - . I ' ' . ' • * ' ' ^ * ' ' 

1. I > ' * c *- - 

', - • ■ " ■ ■ I ■ , 

'' " ' - ' ' 

71 . A stereo vision system for interfacing with an application program runmng 

on a computer, the stereo vision system comprisihg; 

' ■ ' '■' . , . ' . 

first and second video cameras airanged in an adjacent configuration and 
operable to produce a series of stereo video knages; and . 

a processor operable to receive the series of stereo video images and detect 

• " . . . " -. . . ' ■ ' / ^ ■-' '• 

objects appeanng m an intersectmg field of view of the carneras, tiie processor executing 
aprocessto: : . 

• identify an object perceived as the largest object appearing ui th^ 
intersecting field of view of the cameras and positioned at a predetermined depth range; 
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select the object as an object of interest • 
determine a position coordinate representing a position of the 
objectof interest; and ' 

use the position coordinate as an object control point to control the 

.5 application program. 

72. The systeni of claim 71 wherein the process causes the processor to: 
determine and store a neutiiil control point position; . , 

' *> » ' ' r t 

. map a coordinate ofthe object control point relative to the neutral control 
10 point position; and 

. use the mapped object control point coordinate to control the application 
program. , / ^ ■ / 

73. .. The system of claini 72 wherein the process causes the processor to: 

' . . , , ■ ■ • -. 

15 defme a region hayiilg a position 

control point position; 

mop the object control point relative to its position within the region; and . 

* ' - • ' L . 

I 

1 • 

. use the mapped object control point coordinate to control the application 



20 



program. 



74. ITie system of claim 72 wherein the process causes t^^ 

transfonn the mapped object control point to a velocity to 
determine a viewpoint associated with a virtual enviromnent of , the 
application program; and 
25 S' . . use the velocity fiinction tO; move the viewpoint within the virtual . 

'. . 

'!'•.' ' • ' ■ " ' ' ■ .J 

, . ■ ••* 1 ' ' 

t ■ 

' ■'_(■(■ ' ■ • ■ ■ ■ ■ ■ • ■ ■ 

'■ •' - '■ " ' ; . ■ 

. 75 . . The system of claim 7 1 wherein the process causes the processor to map a 
■ coordinate of the object control point to control a position of an indicator within the 
30 application program.. - ^ 



76. The system of claim 75 wherein the indicator is an avatar. 
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I " ' . • - - ■■ . 

77. The system of claim 71 wherein the process causes the processor to rnap a 
coordinate of the objept control point to control an appearance of an indicator within the 
application program. ^ 

78. The system of claim 77 wherein the indicator is an avatar. 

79. . Ttie system of claim 71 wherein the object of interest is a human appearing 
within the intersecting field of view.; . : . 

• ■ • i 

» ' . I - . . . ■ t 

I _ • ■ 

80. A stereo vision system for ihterfacing with an application program running 
on a computer, the stereo vision systeni comprising: 

, first and second video cameras ananged in an adjacent configuration and 
operable to produce a series of stereo video images; and 

a processor operable to receive the series of stereo video images and detect 

■ - 

objects appearing in an intersecting field of yiew of the canieras, the processor executing 

■ , - '■ ■ ' ■ ' . . . ' ' ' 

,aprocessto: 

. ' , identify an object perceived as the largest object appearing in the 

intersecting field of view of the cameras and positioned at a predetermined depth range; 

select the object as an object of intejcest; 
' define a control region between the cameras and the object of 
interest, the control region being positioned at a predetemiined location and haying a 

predetermined size relative to a size aiid a location of the object of interest; 

■ . ■ ' ' ' - 

search the control region for a point associated with the object of 
interest that is closest to the cameras and within the control region; 

1- select the point associated with the object of interest as a control 
point if the point associated vwth the object of interest is within the coritrol region; and , 

map position coordinates of the control point, as the control point 
moves within the control region, to a position indicator associated with the application 
program. 

81. The system of clmm 80 wherem the processor is operable to: 

map a horizontal position of the control point relative to the video cameras 
to an X-axis screen coordinate of the position indicator; 
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•.«'-■-• "1 . • 

' ' I , , ^- 

map a vertical position of the control point relative to the; video cameras to 
. a y-axis screen coordimte of the position indicator; a^^ 

emulate a mouse function using a combination of the x-axis and the y-axis 
screen coordinates. 

82* The system of claim 80 wherein the processor is operable to: 
inap a X-axis position of the control pomt relative to the vi^ 
sui x-axis screen coordinate of the position indicator 

map a y-axis position of the control pomt relative to the video cameras to a 
10 y-axis screen coordinate of the positioii mdicator; and 

..'"[' map a z-axis depth position of the control point relative to the video 

cameras to a virtual z-axis screen coordinate of the position indicator. 

~ "3 ^ * » " • 

,>. . • = • ■ - ",■ ■ • . • 

. 83. ^ , T^ 
15 - wthin the intersecting field of view. 

' ■ ■ • ' " ■■ . . ! - ' 

84. Thesystemof claim 80 wherein the control point is associated 
'/ human hand appearing within the c^^ ' . 

20 . ' ,85. ; A Stereo vision system for interfacing with an application program running 

., , " ■ ' • . V • - • ' 

on a computer, the stereo; vision system con^^ ^ 

first and secorid video cameras ananged in an adjacent configuration and 
operable to produce a series of stereo video images; and ' 

■ a processor operable to receive the series of stereo video images and detect 

25 . objects appearing in an intersecting field of vicAV of the cameras, the processor executing 

" . •' . y./ .. ■■ 

' a process to: . ' . , . ■ , 

define an obj ect detection region m three-dhneiisional coord^^ 
relative to a position of the first and second video cameras; 

select up b two haiid objects from the objects appearing in^ 

' , ' ' ' ■ . . * . ... , ' ' ' ■ 

30 intersecting field of view that are within the object detection region; and 

rriap position coordiiiates of the hand objects, as file hand pbjec 

move within the object detectipii region, to ppsitioiis of virtual hands associated with an 

■ ' / . ■ ' .' " - * 

avatar rendered by the application program. / 
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L .- - ( 

I * ■ . • ■ 

86. The system of claim 85 wherein the process selects the up to two hand, 
objepts from the objects appearing in the intersecting field of view that big closest to the 
video cameras and within the object detection region. 

87. llie system ofclaim 85 wherein the avatar takes the fonn of a h^ 
body. . ' : ' . 

88. .The system of claim 85 wherem the avatar is rendered in and iuter^^ 
a virtual environment forming part of the application program. ' . 

' l' *' I -F* ', 

■ ■ ■ ' ' ■ -■ ' 

. .' ' ' ■ . • . ■ ' .'. ■ 

89. The isystem of claim 88 wherein the processor further executes a process to 
compare the positions of the virtual hands associjated with the avatar to positions of 
virtual objects within the virtual environment to enable a user to interact with the virtual 

objects, within the virtual environment 

«. ' . ■ • • ' . '•. ' " ■ ' 

. i' - ' , I ■ - . • f - ..^ , 

' - • ' " i * , " 

. 90. The system cif claim 85 >vherein the processor further executes a process , 
■to: ' ' ■■ ■■ ■ ■ . ' ' . ' • ' ' ' ' • • ■•■ . 

detect position coordinates of a user within ttie intersecting field of view; 

map the position coordiiiates of the user to a virtual torso of the ava^ 
rendered by the appUcation progf am. 

' V 9i. The system of claim. 85 wherein the process moves at least one of the , 
virtual hands associated with the avatar to a neutral position if a corresponding hand 
object is not selected. 



to: 



and 



92. The system of claim 85 wherein the processor further executes a process 



detect position coordinates of a user within the intersecting field of view; 



map the pbsitioni coordmates of the user to a yelo^^ 

■ >, , ■ • ■ ' • " ' 

applied to the avatar to enable the avatar to roam through a virtual environment rendered 

by the application program. 
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93. 

position denoting zero Velocity of the avatar. - 

. 94» The system of claim 93 wherein the processor further executes a process to 



coordinates associated witii thje avatar so that the avatar appears to lean. 



' 95 . The system of claim 92 wherein the processor further executed a process 
to compare the position of the virtual hands associated with the avatar to positions of 
virtual objects within the virtual envkonment to enable the user to interact with the virtual 

objects vylule roaniing through the virtual enviroimient. 

> ' . 1 ■ ■ ' ■,. . ' " 

■ • ■ . ' ' - • ' " ' \ ■ •' ' ' 

. . 96. The system of claiin 85 wherein a vialualkn^ 

avatar is derived by the application program and used to refine an appearance of the, 

avatar. - ' '.;(. " ' ■ ; " ""■ . . ; 

■" ' ' '■*'*,■ ■ . . - ' .■ . '■• ' ' ■ •■ ■ ' ■ ■ 

- , ' • . ■" : ■ ' -. . ■ ; . " = ' 

■ ■ " " • " ■ ' . - ■ ■ ' ' . .' . . • .• . 

r. " > , ■ '■ ■ . " 

97. The system of claim 85 wherein a virtual elbow position associated with 
the avatar is derived by die application program and used to refme an appearance of the 
avatar.- " . ''' . 



, 98, 

' in an adjacent'bonfiguration with the fir^ and second video cameras and^operable to 
produce die series of stereo video images. ^ ^ • 
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